2014-01-02 07:13:12

by Minchan Kim

[permalink] [raw]
Subject: [PATCH v10 00/16] Volatile Ranges v10

Hey all,

Happy New Year!

I know it's bad timing to send this unfamiliar large patchset for
review but hope there are some guys with freshed-brain in new year
all over the world. :)
And most important thing is that before I dive into lots of testing,
I'd like to make an agreement on design issues and others

o Syscall interface
o Not bind with vma split/merge logic to prevent mmap_sem cost and
o Not bind with vma split/merge logic to avoid vm_area_struct memory
footprint.
o Purging logic - when we trigger purging volatile pages to prevent
working set and stop to prevent too excessive purging of volatile
pages
o How to test
Currently, we have a patched jemalloc allocator by Jason's help
although it's not perfect and more rooms to be enhanced but IMO,
it's enough to prove vrange-anonymous. The problem is that
lack of benchmark for testing vrange-file side. I hope that
Mozilla folks can help.

So its been a while since the last release of the volatile ranges
patches, again. I and John have been busy with other things.
Still, we have been slowly chipping away at issues and differences
trying to get a patchset that we both agree on.

There's still a few issues, but we figured any further polishing of
the patch series in private would be unproductive and it would be much
better to send the patches out for review and comment and get some wider
opinions.

You could get full patchset by git

git clone -b vrange-v10-rc5 --single-branch git://git.kernel.org/pub/scm/linux/kernel/git/minchan/linux.git

In v10, there are some notable changes following as

Whats new in v10:
* Fix several bugs and build break
* Add shmem_purge_page to correct purging shmem/tmpfs
* Replace slab shrinker with direct hooked reclaim path
* Optimize pte scanning by caching previous place
* Reorder patch and tidy up Cc-list
* Rebased on v3.12
* Add vrange-anon test with jemalloc in Dhaval's test suite
- https://github.com/volatile-ranges-test/vranges-test
so, you could test any application with vrange-patched jemalloc by
LD_PRELOAD but please keep in mind that it's just a prototype to
prove vrange syscall concept so it has more rooms to optimize.
So, please do not compare it with another allocator.

Whats new in v9:
* Updated to v3.11
* Added vrange purging logic to purge anonymous pages on
swapless systems
* Added logic to allocate the vroot structure dynamically
to avoid added overhead to mm and address_space structures
* Lots of minor tweaks, changes and cleanups

Still TODO:
* Sort out better solution for clearing volatility on new mmaps
- Minchan has a different approach here
* Agreement of systemcall interface
* Better discarding trigger policy to prevent working set evction
* Review, Review, Review.. Comment.
* A ton of test

Feedback or thoughts here would be particularly helpful!

Also, thanks to Dhaval for his maintaining and vastly improving
the volatile ranges test suite, which can be found here:
[1] https://github.com/volatile-ranges-test/vranges-test

These patches can also be pulled from git here:
git://git.linaro.org/people/jstultz/android-dev.git dev/vrange-v9

We'd really welcome any feedback and comments on the patch series.

thanks

========== &< =========

Volatile ranges provides a method for userland to inform the kernel that
a range of memory is safe to discard (ie: can be regenerated) but
userspace may want to try access it in the future. It can be thought of
as similar to MADV_DONTNEED, but that the actual freeing of the memory
is delayed and only done under memory pressure, and the user can try to
cancel the action and be able to quickly access any unpurged pages. The
idea originated from Android's ashmem, but I've since learned that other
OSes provide similar functionality.

This funcitonality allows for a number of interesting uses:
* Userland caches that have kernel triggered eviction under memory
pressure. This allows for the kernel to "rightsize" userspace caches for
current system-wide workload. Things like image bitmap caches, or
rendered HTML in a hidden browser tab, where the data is not visible and
can be regenerated if needed, are good examples.

* Opportunistic freeing of memory that may be quickly reused. Minchan
has done a malloc implementation where free() marks the pages as
volatile, allowing the kernel to reclaim under pressure. This avoids the
unmapping and remapping of anonymous pages on free/malloc. So if
userland wants to malloc memory quickly after the free, it just needs to
mark the pages as non-volatile, and only purged pages will have to be
faulted back in. I did some test with jemalloc by Jason Mason's help who
is author of jemalloc because he had interest on vrange sytem call.

Test(RAM 2G, CPU 4, ebizzy benchmark)
ebizzy argument: ./ebizzy -S 30 -n 512

default chunksize = 512k so 512k * 512 = 256M, *a* ebizzy process
has 256M footprint.

(1.1) stands for 1 process and 1 thread so (1.4) is
1 process and 4 thread.

vanilla patched
1.1 1.1
records:5 records:5
sum:30225 sum:151159
avg:6045 avg:30231.8
std:12.6174482365881 std:145.0839756831
med:6042 med:30281
max:6064 max:30363
min:6026 min:29953
1.4 1.4
records:5 records:5
sum:74882 sum:281708
avg:14976.4 avg:56341.6
std:177.827556919662 std:924.991156714412
med:14990 med:56420
max:15242 max:57398
min:14683 min:54704
1.8 1.8
records:5 records:5
sum:75060 sum:246196
avg:15012 avg:49239.2
std:166.670933278686 std:2072.42248588458
med:14985 med:50622
max:15307 max:50863
min:14790 min:45440
1.16 1.16
records:5 records:5
sum:92251 sum:230435
avg:18450.2 avg:46087
std:121.169963274595 std:735.596356706584
med:18531 med:46339
max:18554 max:46810
min:18242 min:44737
4.1 4.1
records:5 records:5
sum:18832 sum:50573
avg:3766.4 avg:10114.6
std:41.3018159407047 std:100.183032495457
med:3759 med:10184
max:3843 max:10209
min:3724 min:9926
4.4 4.4
records:5 records:5
sum:18748 sum:40348
avg:3749.6 avg:8069.6
std:29.5133867930996 std:80.6091806185631
med:3741 med:8013
max:3803 max:8170
min:3721 min:7993
4.8 4.8
records:5 records:5
sum:18783 sum:40576
avg:3756.6 avg:8115.2
std:34.7770038962723 std:66.3789123141068
med:3747 med:8111
max:3820 max:8196
min:3716 min:8033
4.16 4.16
records:5 records:5
sum:21926 sum:29612
avg:4385.2 avg:5922.4
std:36.4219713909391 std:1486.31189189887
med:4391 med:5123
max:4431 max:8216
min:4319 min:4537

In every case, patched jemallloc allocator is win but as memory pressure
is severe, the gain was reduced but still better.
The stddev is rather higher old. I guess some reasons but need more to
investigate it. Of course, I need more testing on various workloads.
It should be TODO.

The syscall interface is defined in patch [4/16] in this series, but
briefly there are two ways to utilze the functionality:

Explicit marking method:
1) Userland marks a range of memory that can be regenerated if necessary
as volatile
2) Before accessing the memory again, userland marks the memroy as
nonvolatile, and the kernel will provide notifcation if any pages in the
range has been purged.

Optimistic method:
1) Userland marks a large range of data as volatile
2) Userland continues to access the data as it needs.
3) If userland accesses a page that has been purged, the kernel will
send a SIGBUS
4) Userspace can trap the SIGBUS, mark the afected pages as
non-volatile, and refill the data as needed before continuing on

Other details:
The interface takes a range of memory, which can cover anonymous pages
as well as mmapped file pages. In the case that the pages are from a
shared mmapped file, the volatility set on those file pages is global.
Thus much as writes to those pages are shared to other processes, pages
marked volatile will be volatile to any other processes that have the
file mapped as well. It is advised that processes coordinate when using
volatile ranges on shared mappings (much as they must coordinate when
writing to shared data). Any uncleared volatility on mmapped files will
last until the the file is closed by all users (ie: volatility isn't
persistent on disk).

Volatility on anonymous pages are inherited across forks, but cleared on
exec.

You can read more about the history of volatile ranges here:
http://permalink.gmane.org/gmane.linux.kernel.mm/98848
http://permalink.gmane.org/gmane.linux.kernel.mm/98676
https://lwn.net/Articles/522135/
https://lwn.net/Kernel/Index/#Volatile_ranges

John Stultz (2):
vrange: Clear volatility on new mmaps
vrange: Add support for volatile ranges on file mappings

Minchan Kim (14):
vrange: Add vrange support to mm_structs
vrange: Add new vrange(2) system call
vrange: Add basic functions to purge volatile pages
vrange: introduce fake VM_VRANGE flag
vrange: Purge volatile pages when memory is tight
vrange: Send SIGBUS when user try to access purged page
vrange: Add core shrinking logic for swapless system
vrange: Purging vrange-anon pages from shrinker
vrange: support shmem_purge_page
vrange: Support background purging for vrange-file
vrange: Allocate vroot dynamically
vrange: Change purged with hint
vrange: Prevent unnecessary scanning
vrange: Add vmstat counter about purged page

arch/x86/syscalls/syscall_64.tbl | 1 +
fs/inode.c | 4 +
include/linux/fs.h | 4 +
include/linux/mm.h | 9 +
include/linux/mm_types.h | 4 +
include/linux/shmem_fs.h | 1 +
include/linux/swap.h | 48 +-
include/linux/syscalls.h | 2 +
include/linux/vm_event_item.h | 6 +
include/linux/vrange.h | 45 +-
include/linux/vrange_types.h | 6 +-
include/uapi/asm-generic/mman-common.h | 3 +
kernel/fork.c | 12 +
kernel/sys_ni.c | 1 +
mm/internal.h | 2 -
mm/memory.c | 35 +-
mm/mincore.c | 5 +-
mm/mmap.c | 5 +
mm/rmap.c | 17 +-
mm/shmem.c | 46 ++
mm/swapfile.c | 37 +
mm/vmscan.c | 72 +-
mm/vmstat.c | 6 +
mm/vrange.c | 1174 +++++++++++++++++++++++++++++++-
24 files changed, 1477 insertions(+), 68 deletions(-)

--
1.7.9.5


2014-01-02 07:13:13

by Minchan Kim

[permalink] [raw]
Subject: [PATCH v10 01/16] vrange: Add vrange support to mm_structs

This patch adds vroot on mm_struct so process can set volatile
ranges on anonymous memory.

This is somewhat wasteful, as it increases the mm struct even
if the process doesn't use vrange syscall. So a later patch
will provide dynamically allocated vroots.

One of note on this patch is vrange_fork. Since we do allocations
while holding a lock on the vrange, its possible it could deadlock
with direct reclaim's purging logic. For this reason, vrange_fork
uses GFP_NOIO for its allocations.

If vrange_fork fails, it isn't a critical problem. Since the result
is the child process's pages won't be volatile/purgable, which
could cause additional memory pressure, but won't cause problematic
application behavior (since volatile pages are only purged at the
kernels' discretion). This is thought to be more desirable then
having fork fail.

NOTE: Additionally, as a optimization point, we could remove pages
like MADV_DONTNEED instantly when we see the allocation fail.
There would be no point to make new volatile ranges when memory
pressure was already tight.

Cc: Mel Gorman <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Cc: Michel Lespinasse <[email protected]>
Cc: Johannes Weiner <[email protected]>
[jstultz: Bit of refactoring. Comment cleanups]
Signed-off-by: John Stultz <[email protected]>
Signed-off-by: Minchan Kim <[email protected]>
---
include/linux/mm_types.h | 4 ++++
include/linux/vrange.h | 7 ++++++-
kernel/fork.c | 11 +++++++++++
mm/vrange.c | 40 ++++++++++++++++++++++++++++++++++++++++
4 files changed, 61 insertions(+), 1 deletion(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index d9851eeb6e1d..a4de9cfa8ff1 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -13,6 +13,7 @@
#include <linux/page-debug-flags.h>
#include <linux/uprobes.h>
#include <linux/page-flags-layout.h>
+#include <linux/vrange_types.h>
#include <asm/page.h>
#include <asm/mmu.h>

@@ -350,6 +351,9 @@ struct mm_struct {
*/


+#ifdef CONFIG_MMU
+ struct vrange_root vroot;
+#endif
unsigned long hiwater_rss; /* High-watermark of RSS usage */
unsigned long hiwater_vm; /* High-water virtual memory usage */

diff --git a/include/linux/vrange.h b/include/linux/vrange.h
index 0d378a5dc8d7..2b96ee1ee75b 100644
--- a/include/linux/vrange.h
+++ b/include/linux/vrange.h
@@ -37,12 +37,17 @@ static inline int vrange_type(struct vrange *vrange)
}

extern void vrange_root_cleanup(struct vrange_root *vroot);
-
+extern int vrange_fork(struct mm_struct *new,
+ struct mm_struct *old);
#else

static inline void vrange_root_init(struct vrange_root *vroot,
int type, void *object) {};
static inline void vrange_root_cleanup(struct vrange_root *vroot) {};
+static inline int vrange_fork(struct mm_struct *new, struct mm_struct *old)
+{
+ return 0;
+}

#endif
#endif /* _LINIUX_VRANGE_H */
diff --git a/kernel/fork.c b/kernel/fork.c
index 086fe73ad6bd..36d3c4bb4c4d 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -71,6 +71,7 @@
#include <linux/signalfd.h>
#include <linux/uprobes.h>
#include <linux/aio.h>
+#include <linux/vrange.h>

#include <asm/pgtable.h>
#include <asm/pgalloc.h>
@@ -376,6 +377,14 @@ static int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
retval = khugepaged_fork(mm, oldmm);
if (retval)
goto out;
+ /*
+ * Note: vrange_fork can fail in the case of ENOMEM, but
+ * this only results in the child not having any active
+ * volatile ranges. This is not harmful. Thus in this case
+ * the child will not see any pages purged unless it remarks
+ * them as volatile.
+ */
+ vrange_fork(mm, oldmm);

prev = NULL;
for (mpnt = oldmm->mmap; mpnt; mpnt = mpnt->vm_next) {
@@ -535,6 +544,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p)
mm->nr_ptes = 0;
memset(&mm->rss_stat, 0, sizeof(mm->rss_stat));
spin_lock_init(&mm->page_table_lock);
+ vrange_root_init(&mm->vroot, VRANGE_MM, mm);
mm_init_aio(mm);
mm_init_owner(mm, p);

@@ -606,6 +616,7 @@ void mmput(struct mm_struct *mm)

if (atomic_dec_and_test(&mm->mm_users)) {
uprobe_clear_state(mm);
+ vrange_root_cleanup(&mm->vroot);
exit_aio(mm);
ksm_exit(mm);
khugepaged_exit(mm); /* must run before exit_mmap */
diff --git a/mm/vrange.c b/mm/vrange.c
index a5daea44e031..57dad4d72b04 100644
--- a/mm/vrange.c
+++ b/mm/vrange.c
@@ -182,3 +182,43 @@ void vrange_root_cleanup(struct vrange_root *vroot)
vrange_unlock(vroot);
}

+/*
+ * It's okay to fail vrange_fork because worst case is child process
+ * can't have copied own vrange data structure so that pages in the
+ * vrange couldn't be purged. It would be better rather than failing
+ * fork.
+ */
+int vrange_fork(struct mm_struct *new_mm, struct mm_struct *old_mm)
+{
+ struct vrange_root *new, *old;
+ struct vrange *range, *new_range;
+ struct rb_node *next;
+
+ new = &new_mm->vroot;
+ old = &old_mm->vroot;
+
+ vrange_lock(old);
+ next = rb_first(&old->v_rb);
+ while (next) {
+ range = vrange_entry(next);
+ next = rb_next(next);
+ /*
+ * We can't use GFP_KERNEL because direct reclaim's
+ * purging logic on vrange could be deadlock by
+ * vrange_lock.
+ */
+ new_range = __vrange_alloc(GFP_NOIO);
+ if (!new_range)
+ goto fail;
+ __vrange_set(new_range, range->node.start,
+ range->node.last, range->purged);
+ __vrange_add(new_range, new);
+
+ }
+ vrange_unlock(old);
+ return 0;
+fail:
+ vrange_unlock(old);
+ vrange_root_cleanup(new);
+ return -ENOMEM;
+}
--
1.7.9.5

2014-01-02 07:13:15

by Minchan Kim

[permalink] [raw]
Subject: [PATCH v10 02/16] vrange: Clear volatility on new mmaps

From: John Stultz <[email protected]>

At lsf-mm, the issue was brought up that there is a precedence with
interfaces like mlock, such that new mappings in a pre-existing range
do no inherit the mlock state.

This is mostly because mlock only modifies the existing vmas, and so
any new mmaps create new vmas, which won't be mlocked.

Since volatility is not stored in the vma (for good cause, specifically
as we'd have to have manage file volatility differently from anonymous
and we're likely to manage volatility on small chunks of memory, which
would cause lots of vma splitting and churn), this patch clears volitility
on new mappings, to ensure that we don't inherit volatility if memory in
an existing volatile range is unmapped and then re-mapped with something
else.

Thus, this patch forces any volatility to be cleared on mmap.

XXX: We expect this patch to be not well loved by mm folks, and are open
to alternative methods here. Its more of a place holder to address
the issue from lsf-mm and hopefully will spur some further discussion.

Minchan does have an alternative solution[1], but I'm not a big fan of it
yet, so this simpler approach is a placeholder for now.

[1] https://git.kernel.org/cgit/linux/kernel/git/minchan/linux.git/commit/?h=vrange-working&id=821f58333b381fd88ee7f37fd9c472949756c74e

Cc: Mel Gorman <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Cc: Michel Lespinasse <[email protected]>
Cc: Johannes Weiner <[email protected]>
[minchan: add link alternative solution]
Signed-off-by: Minchan Kim <[email protected]>
Signed-off-by: John Stultz <[email protected]>
---
include/linux/vrange.h | 2 ++
mm/mmap.c | 5 +++++
mm/vrange.c | 8 ++++++++
3 files changed, 15 insertions(+)

diff --git a/include/linux/vrange.h b/include/linux/vrange.h
index 2b96ee1ee75b..ef153c8a88d1 100644
--- a/include/linux/vrange.h
+++ b/include/linux/vrange.h
@@ -36,6 +36,8 @@ static inline int vrange_type(struct vrange *vrange)
return vrange->owner->type;
}

+extern int vrange_clear(struct vrange_root *vroot,
+ unsigned long start, unsigned long end);
extern void vrange_root_cleanup(struct vrange_root *vroot);
extern int vrange_fork(struct mm_struct *new,
struct mm_struct *old);
diff --git a/mm/mmap.c b/mm/mmap.c
index 9d548512ff8a..b8e2c1e57336 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -36,6 +36,7 @@
#include <linux/sched/sysctl.h>
#include <linux/notifier.h>
#include <linux/memory.h>
+#include <linux/vrange.h>

#include <asm/uaccess.h>
#include <asm/cacheflush.h>
@@ -1503,6 +1504,10 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
/* Clear old maps */
error = -ENOMEM;
munmap_back:
+
+ /* zap any volatile ranges */
+ vrange_clear(&mm->vroot, addr, addr + len);
+
if (find_vma_links(mm, addr, addr + len, &prev, &rb_link, &rb_parent)) {
if (do_munmap(mm, addr, len))
return -ENOMEM;
diff --git a/mm/vrange.c b/mm/vrange.c
index 57dad4d72b04..444da8794dbf 100644
--- a/mm/vrange.c
+++ b/mm/vrange.c
@@ -167,6 +167,14 @@ static int vrange_remove(struct vrange_root *vroot,
return 0;
}

+int vrange_clear(struct vrange_root *vroot,
+ unsigned long start, unsigned long end)
+{
+ int purged;
+
+ return vrange_remove(vroot, start, end - 1, &purged);
+}
+
void vrange_root_cleanup(struct vrange_root *vroot)
{
struct vrange *range;
--
1.7.9.5

2014-01-02 07:14:05

by Minchan Kim

[permalink] [raw]
Subject: [PATCH v10 12/16] vrange: Support background purging for vrange-file

Add support to purge vrange file pages.

This is useful, since some filesystems like shmem/tmpfs use anonymous
pages, which won't be aged off the page LRU if swap is disabled.

Cc: Mel Gorman <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Cc: Michel Lespinasse <[email protected]>
Cc: Johannes Weiner <[email protected]>
[jstultz: Commit message tweaks]
Signed-off-by: John Stultz <[email protected]>
Signed-off-by: Minchan Kim <[email protected]>
---
mm/vrange.c | 57 +++++++++++++++++++++++++++++++++++++++++++++++++--------
1 file changed, 49 insertions(+), 8 deletions(-)

diff --git a/mm/vrange.c b/mm/vrange.c
index ed89835bcff4..51875f256592 100644
--- a/mm/vrange.c
+++ b/mm/vrange.c
@@ -13,6 +13,7 @@
#include <linux/mmu_notifier.h>
#include <linux/mm_inline.h>
#include <linux/migrate.h>
+#include <linux/pagevec.h>
#include <linux/shmem_fs.h>

static struct kmem_cache *vrange_cachep;
@@ -853,24 +854,64 @@ out:
return ret;
}

+static int __discard_vrange_file(struct address_space *mapping,
+ struct vrange *vrange, unsigned long *ret_discard)
+{
+ struct pagevec pvec;
+ pgoff_t index;
+ int i, ret = 0;
+ unsigned long nr_discard = 0;
+ unsigned long start_idx = vrange->node.start;
+ unsigned long end_idx = vrange->node.last;
+ const pgoff_t start = start_idx >> PAGE_CACHE_SHIFT;
+ pgoff_t end = end_idx >> PAGE_CACHE_SHIFT;
+ LIST_HEAD(pagelist);
+
+ pagevec_init(&pvec, 0);
+ index = start;
+ while (index <= end && pagevec_lookup(&pvec, mapping, index,
+ min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1)) {
+ for (i = 0; i < pagevec_count(&pvec); i++) {
+ struct page *page = pvec.pages[i];
+ index = page->index;
+ if (index > end)
+ break;
+ if (isolate_lru_page(page))
+ continue;
+ list_add(&page->lru, &pagelist);
+ inc_zone_page_state(page, NR_ISOLATED_ANON);
+ }
+ pagevec_release(&pvec);
+ cond_resched();
+ index++;
+ }
+
+ if (!list_empty(&pagelist))
+ nr_discard = discard_vrange_pagelist(&pagelist);
+
+ *ret_discard = nr_discard;
+ putback_lru_pages(&pagelist);
+
+ return ret;
+}
+
static int discard_vrange(struct vrange *vrange, unsigned long *nr_discard)
{
int ret = 0;
- struct mm_struct *mm;
struct vrange_root *vroot;
vroot = vrange->owner;

- /* TODO : handle VRANGE_FILE */
- if (vroot->type != VRANGE_MM)
- goto out;
+ if (vroot->type == VRANGE_MM) {
+ struct mm_struct *mm = vroot->object;
+ ret = __discard_vrange_anon(mm, vrange, nr_discard);
+ } else if (vroot->type == VRANGE_FILE) {
+ struct address_space *mapping = vroot->object;
+ ret = __discard_vrange_file(mapping, vrange, nr_discard);
+ }

- mm = vroot->object;
- ret = __discard_vrange_anon(mm, vrange, nr_discard);
-out:
return ret;
}

-
#define VRANGE_SCAN_THRESHOLD (4 << 20)

unsigned long shrink_vrange(enum lru_list lru, struct lruvec *lruvec,
--
1.7.9.5

2014-01-02 07:14:26

by Minchan Kim

[permalink] [raw]
Subject: [PATCH v10 04/16] vrange: Add new vrange(2) system call

This patch adds new system call sys_vrange.

NAME
vrange - Mark or unmark range of memory as volatile

SYNOPSIS
int vrange(unsigned_long start, size_t length, int mode,
int *purged);

DESCRIPTION
Applications can use vrange(2) to advise the kernel how it should
handle paging I/O in this VM area. The idea is to help the kernel
discard pages of vrange instead of reclaiming when memory pressure
happens. It means kernel doesn't discard any pages of vrange if
there is no memory pressure.

mode:
VRANGE_VOLATILE
hint to kernel so VM can discard in vrange pages when
memory pressure happens.
VRANGE_NONVOLATILE
hint to kernel so VM doesn't discard vrange pages
any more.

If user try to access purged memory without VRANGE_NOVOLATILE call,
he can encounter SIGBUS if the page was discarded by kernel.

purged: Pointer to an integer which will return 1 if
mode == VRANGE_NONVOLATILE and any page in the affected range
was purged. If purged returns zero during a mode ==
VRANGE_NONVOLATILE call, it means all of the pages in the range
are intact.

RETURN VALUE
On success vrange returns the number of bytes marked or unmarked.
Similar to write(), it may return fewer bytes then specified
if it ran into a problem.

If an error is returned, no changes were made.

ERRORS
EINVAL This error can occur for the following reasons:
* The value length is negative or not page size units.
* addr is not page-aligned
* mode not a valid value.

ENOMEM Not enough memory

EFAULT purged pointer is invalid

There were some comment about this interface.
Firstly, it was suggested by part of madvise(2) but there were some reason
to make it hard.

o Why is it hard to make it based on madvise(2) and vm_area_struct?

The madvise syscall logic is based on vma split/merging but vrange
syscall want to avoid it because it requires mmap_sem which is very
coarse-graind lock and it is critical for multi-threaded friendly
userspace allocator and vma split/merge could create lots of
vm_area_struct because we don't want to merge adjacent volatile
ranges so that we would support fine-grained purging without
propagating purging into another volatile ranges.
For exmaple, firefox folks want to make volatile range as page unit
so if we create vm_area_struct per PAGE_SIZE range,
memory footprint will be much bigger.

Cc: Android Kernel Team <[email protected]>
Cc: Robert Love <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Dmitry Adamushko <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Neil Brown <[email protected]>
Cc: Andrea Righi <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Mike Hommey <[email protected]>
Cc: Taras Glek <[email protected]>
Cc: Dhaval Giani <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Cc: Michel Lespinasse <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Rob Clark <[email protected]>
Cc: [email protected] <[email protected]>
Signed-off-by: Minchan Kim <[email protected]>
Signed-off-by: John Stultz <[email protected]>
---
arch/x86/syscalls/syscall_64.tbl | 1 +
include/linux/syscalls.h | 2 +
include/uapi/asm-generic/mman-common.h | 3 +
kernel/sys_ni.c | 1 +
mm/vrange.c | 164 ++++++++++++++++++++++++++++++++
5 files changed, 171 insertions(+)

diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index 38ae65dfd14f..dc332bdc3514 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -320,6 +320,7 @@
311 64 process_vm_writev sys_process_vm_writev
312 common kcmp sys_kcmp
313 common finit_module sys_finit_module
+314 common vrange sys_vrange

#
# x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 7fac04e7ff6e..2c56f954effe 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -847,4 +847,6 @@ asmlinkage long sys_process_vm_writev(pid_t pid,
asmlinkage long sys_kcmp(pid_t pid1, pid_t pid2, int type,
unsigned long idx1, unsigned long idx2);
asmlinkage long sys_finit_module(int fd, const char __user *uargs, int flags);
+asmlinkage long sys_vrange(unsigned long start, size_t len, int mode,
+ int __user *purged);
#endif
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index 4164529a94f9..9be120b3b33f 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -66,4 +66,7 @@
#define MAP_HUGE_SHIFT 26
#define MAP_HUGE_MASK 0x3f

+#define VRANGE_VOLATILE 0 /* unpin pages so VM can discard them */
+#define VRANGE_NONVOLATILE 1 /* pin pages so VM can't discard them */
+
#endif /* __ASM_GENERIC_MMAN_COMMON_H */
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 7078052284fd..f40070eff8a1 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -175,6 +175,7 @@ cond_syscall(sys_mremap);
cond_syscall(sys_remap_file_pages);
cond_syscall(compat_sys_move_pages);
cond_syscall(compat_sys_migrate_pages);
+cond_syscall(sys_vrange);

/* block-layer dependent */
cond_syscall(sys_bdflush);
diff --git a/mm/vrange.c b/mm/vrange.c
index 444da8794dbf..9ed5610b2e54 100644
--- a/mm/vrange.c
+++ b/mm/vrange.c
@@ -4,6 +4,8 @@

#include <linux/vrange.h>
#include <linux/slab.h>
+#include <linux/syscalls.h>
+#include <linux/mman.h>

static struct kmem_cache *vrange_cachep;

@@ -230,3 +232,165 @@ fail:
vrange_root_cleanup(new);
return -ENOMEM;
}
+
+static inline struct vrange_root *__vma_to_vroot(struct vm_area_struct *vma)
+{
+ struct vrange_root *vroot = NULL;
+
+ if (vma->vm_file && (vma->vm_flags & VM_SHARED))
+ vroot = &vma->vm_file->f_mapping->vroot;
+ else
+ vroot = &vma->vm_mm->vroot;
+ return vroot;
+}
+
+static inline unsigned long __vma_addr_to_index(struct vm_area_struct *vma,
+ unsigned long addr)
+{
+ if (vma->vm_file && (vma->vm_flags & VM_SHARED))
+ return (vma->vm_pgoff << PAGE_SHIFT) + addr - vma->vm_start;
+ return addr;
+}
+
+static ssize_t do_vrange(struct mm_struct *mm, unsigned long start_idx,
+ unsigned long end_idx, int mode, int *purged)
+{
+ struct vm_area_struct *vma;
+ unsigned long orig_start = start_idx;
+ ssize_t count = 0, ret = 0;
+
+ down_read(&mm->mmap_sem);
+
+ vma = find_vma(mm, start_idx);
+ for (;;) {
+ struct vrange_root *vroot;
+ unsigned long tmp, vstart_idx, vend_idx;
+
+ if (!vma)
+ goto out;
+
+ if (vma->vm_flags & (VM_SPECIAL|VM_LOCKED|VM_MIXEDMAP|
+ VM_HUGETLB))
+ goto out;
+
+ /* make sure start is at the front of the current vma*/
+ if (start_idx < vma->vm_start) {
+ start_idx = vma->vm_start;
+ if (start_idx > end_idx)
+ goto out;
+ }
+
+ /* bound tmp to closer of vm_end & end */
+ tmp = vma->vm_end - 1;
+ if (end_idx < tmp)
+ tmp = end_idx;
+
+ vroot = __vma_to_vroot(vma);
+ vstart_idx = __vma_addr_to_index(vma, start_idx);
+ vend_idx = __vma_addr_to_index(vma, tmp);
+
+ /* mark or unmark */
+ if (mode == VRANGE_VOLATILE)
+ ret = vrange_add(vroot, vstart_idx, vend_idx);
+ else if (mode == VRANGE_NONVOLATILE)
+ ret = vrange_remove(vroot, vstart_idx, vend_idx,
+ purged);
+
+ if (ret)
+ goto out;
+
+ /* update count to distance covered so far*/
+ count = tmp - orig_start + 1;
+
+ /* move start up to the end of the vma*/
+ start_idx = vma->vm_end;
+ if (start_idx > end_idx)
+ goto out;
+ /* move to the next vma */
+ vma = vma->vm_next;
+ }
+out:
+ up_read(&mm->mmap_sem);
+
+ /* report bytes successfully marked, even if we're exiting on error */
+ if (count)
+ return count;
+
+ return ret;
+}
+
+/*
+ * The vrange(2) system call.
+ *
+ * Applications can use vrange() to advise the kernel how it should
+ * handle paging I/O in this VM area. The idea is to help the kernel
+ * discard pages of vrange instead of swapping out when memory pressure
+ * happens. The information provided is advisory only, and can be safely
+ * disregarded by the kernel if system has enough free memory.
+ *
+ * mode values:
+ * VRANGE_VOLATILE - hint to kernel so VM can discard vrange pages when
+ * memory pressure happens.
+ * VRANGE_NONVOLATILE - Removes any volatile hints previous specified in that
+ * range.
+ *
+ * purged ptr:
+ * Returns 1 if any page in the range being marked nonvolatile has been purged.
+ *
+ * Return values:
+ * On success vrange returns the number of bytes marked or unmarked.
+ * Similar to write(), it may return fewer bytes then specified if
+ * it ran into a problem.
+ *
+ * If an error is returned, no changes were made.
+ *
+ * Errors:
+ * -EINVAL - start len < 0, start is not page-aligned, start is greater
+ * than TASK_SIZE or "mode" is not a valid value.
+ * -ENOMEM - Short of free memory in system for successful system call.
+ * -EFAULT - Purged pointer is invalid.
+ * -ENOSUP - Feature not yet supported.
+ */
+SYSCALL_DEFINE4(vrange, unsigned long, start,
+ size_t, len, int, mode, int __user *, purged)
+{
+ unsigned long end;
+ struct mm_struct *mm = current->mm;
+ ssize_t ret = -EINVAL;
+ int p = 0;
+
+ if (start & ~PAGE_MASK)
+ goto out;
+
+ len &= PAGE_MASK;
+ if (!len)
+ goto out;
+
+ end = start + len;
+ if (end < start)
+ goto out;
+
+ if (start >= TASK_SIZE)
+ goto out;
+
+ if (purged) {
+ /* Test pointer is valid before making any changes */
+ if (put_user(p, purged))
+ return -EFAULT;
+ }
+
+ ret = do_vrange(mm, start, end - 1, mode, &p);
+
+ if (purged) {
+ if (put_user(p, purged)) {
+ /*
+ * This would be bad, since we've modified volatilty
+ * and the change in purged state would be lost.
+ */
+ BUG();
+ }
+ }
+
+out:
+ return ret;
+}
--
1.7.9.5

2014-01-02 07:14:24

by Minchan Kim

[permalink] [raw]
Subject: [PATCH v10 05/16] vrange: Add basic functions to purge volatile pages

This patch adds discard_vpage and related functions to purge
anonymous and file volatile pages.

It is in preparation for purging volatile pages when memory is tight.
The logic to trigger purge volatile pages will be introduced in the
next patch.

Cc: Mel Gorman <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Cc: Michel Lespinasse <[email protected]>
Cc: Johannes Weiner <[email protected]>
[jstultz: Reworked to add purging of file pages, commit log tweaks]
Signed-off-by: John Stultz <[email protected]>
Signed-off-by: Minchan Kim <[email protected]>
---
include/linux/vrange.h | 9 +++
mm/internal.h | 2 -
mm/vrange.c | 192 ++++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 201 insertions(+), 2 deletions(-)

diff --git a/include/linux/vrange.h b/include/linux/vrange.h
index ef153c8a88d1..778902d9cc30 100644
--- a/include/linux/vrange.h
+++ b/include/linux/vrange.h
@@ -41,6 +41,9 @@ extern int vrange_clear(struct vrange_root *vroot,
extern void vrange_root_cleanup(struct vrange_root *vroot);
extern int vrange_fork(struct mm_struct *new,
struct mm_struct *old);
+int discard_vpage(struct page *page);
+bool vrange_addr_volatile(struct vm_area_struct *vma, unsigned long addr);
+
#else

static inline void vrange_root_init(struct vrange_root *vroot,
@@ -51,5 +54,11 @@ static inline int vrange_fork(struct mm_struct *new, struct mm_struct *old)
return 0;
}

+static inline bool vrange_addr_volatile(struct vm_area_struct *vma,
+ unsigned long addr)
+{
+ return false;
+}
+static inline int discard_vpage(struct page *page) { return 0 };
#endif
#endif /* _LINIUX_VRANGE_H */
diff --git a/mm/internal.h b/mm/internal.h
index 684f7aa9692a..a4f6495cc930 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -225,10 +225,8 @@ static inline void mlock_migrate_page(struct page *newpage, struct page *page)

extern pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma);

-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
extern unsigned long vma_address(struct page *page,
struct vm_area_struct *vma);
-#endif
#else /* !CONFIG_MMU */
static inline int mlocked_vma_newpage(struct vm_area_struct *v, struct page *p)
{
diff --git a/mm/vrange.c b/mm/vrange.c
index 9ed5610b2e54..18afe94d3f13 100644
--- a/mm/vrange.c
+++ b/mm/vrange.c
@@ -6,6 +6,12 @@
#include <linux/slab.h>
#include <linux/syscalls.h>
#include <linux/mman.h>
+#include <linux/pagemap.h>
+#include <linux/rmap.h>
+#include <linux/hugetlb.h>
+#include "internal.h"
+#include <linux/swap.h>
+#include <linux/mmu_notifier.h>

static struct kmem_cache *vrange_cachep;

@@ -64,6 +70,19 @@ static inline void __vrange_resize(struct vrange *range,
__vrange_add(range, vroot);
}

+static struct vrange *__vrange_find(struct vrange_root *vroot,
+ unsigned long start_idx,
+ unsigned long end_idx)
+{
+ struct vrange *range = NULL;
+ struct interval_tree_node *node;
+
+ node = interval_tree_iter_first(&vroot->v_rb, start_idx, end_idx);
+ if (node)
+ range = vrange_from_node(node);
+ return range;
+}
+
static int vrange_add(struct vrange_root *vroot,
unsigned long start_idx, unsigned long end_idx)
{
@@ -394,3 +413,176 @@ SYSCALL_DEFINE4(vrange, unsigned long, start,
out:
return ret;
}
+
+bool vrange_addr_volatile(struct vm_area_struct *vma, unsigned long addr)
+{
+ struct vrange_root *vroot;
+ unsigned long vstart_idx, vend_idx;
+ bool ret = false;
+
+ vroot = __vma_to_vroot(vma);
+ vstart_idx = __vma_addr_to_index(vma, addr);
+ vend_idx = vstart_idx + PAGE_SIZE - 1;
+
+ vrange_lock(vroot);
+ if (__vrange_find(vroot, vstart_idx, vend_idx))
+ ret = true;
+ vrange_unlock(vroot);
+ return ret;
+}
+
+/* Caller should hold vrange_lock */
+static void do_purge(struct vrange_root *vroot,
+ unsigned long start_idx, unsigned long end_idx)
+{
+ struct vrange *range;
+ struct interval_tree_node *node;
+
+ node = interval_tree_iter_first(&vroot->v_rb, start_idx, end_idx);
+ while (node) {
+ range = container_of(node, struct vrange, node);
+ range->purged = true;
+ node = interval_tree_iter_next(node, start_idx, end_idx);
+ }
+}
+
+static void try_to_discard_one(struct vrange_root *vroot, struct page *page,
+ struct vm_area_struct *vma, unsigned long addr)
+{
+ struct mm_struct *mm = vma->vm_mm;
+ pte_t *pte;
+ pte_t pteval;
+ spinlock_t *ptl;
+
+ VM_BUG_ON(!PageLocked(page));
+
+ pte = page_check_address(page, mm, addr, &ptl, 0);
+ if (!pte)
+ return;
+
+ BUG_ON(vma->vm_flags & (VM_SPECIAL|VM_LOCKED|VM_MIXEDMAP|VM_HUGETLB));
+
+ flush_cache_page(vma, addr, page_to_pfn(page));
+ pteval = ptep_clear_flush(vma, addr, pte);
+
+ update_hiwater_rss(mm);
+ if (PageAnon(page))
+ dec_mm_counter(mm, MM_ANONPAGES);
+ else
+ dec_mm_counter(mm, MM_FILEPAGES);
+
+ page_remove_rmap(page);
+ page_cache_release(page);
+
+ pte_unmap_unlock(pte, ptl);
+ mmu_notifier_invalidate_page(mm, addr);
+
+ addr = __vma_addr_to_index(vma, addr);
+
+ do_purge(vroot, addr, addr + PAGE_SIZE - 1);
+}
+
+static int try_to_discard_anon_vpage(struct page *page)
+{
+ struct anon_vma *anon_vma;
+ struct anon_vma_chain *avc;
+ pgoff_t pgoff;
+ struct vm_area_struct *vma;
+ struct mm_struct *mm;
+ struct vrange_root *vroot;
+
+ unsigned long address;
+
+ anon_vma = page_lock_anon_vma_read(page);
+ if (!anon_vma)
+ return -1;
+
+ pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+ /*
+ * During interating the loop, some processes could see a page as
+ * purged while others could see a page as not-purged because we have
+ * no global lock between parent and child for protecting vrange system
+ * call during this loop. But it's not a problem because the page is
+ * not *SHARED* page but *COW* page so parent and child can see other
+ * data anytime. The worst case by this race is a page was purged
+ * but couldn't be discarded so it makes unnecessary page fault but
+ * it wouldn't be severe.
+ */
+ anon_vma_interval_tree_foreach(avc, &anon_vma->rb_root, pgoff, pgoff) {
+ vma = avc->vma;
+ mm = vma->vm_mm;
+ vroot = &mm->vroot;
+ address = vma_address(page, vma);
+
+ vrange_lock(vroot);
+ if (!__vrange_find(vroot, address, address + PAGE_SIZE - 1)) {
+ vrange_unlock(vroot);
+ continue;
+ }
+
+ try_to_discard_one(vroot, page, vma, address);
+ vrange_unlock(vroot);
+ }
+
+ page_unlock_anon_vma_read(anon_vma);
+ return 0;
+}
+
+static int try_to_discard_file_vpage(struct page *page)
+{
+ struct address_space *mapping = page->mapping;
+ pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+ struct vm_area_struct *vma;
+ struct vrange_root *vroot;
+ unsigned long vstart_idx;
+ int ret = 1;
+
+ if (!page->mapping)
+ return ret;
+
+ vroot = &mapping->vroot;
+ vstart_idx = page->index << PAGE_SHIFT;
+
+ mutex_lock(&mapping->i_mmap_mutex);
+ vrange_lock(vroot);
+
+ if (!__vrange_find(vroot, vstart_idx, vstart_idx + PAGE_SIZE - 1))
+ goto out;
+
+ vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
+ unsigned long address = vma_address(page, vma);
+ try_to_discard_one(vroot, page, vma, address);
+ }
+
+ VM_BUG_ON(page_mapped(page));
+ ret = 0;
+out:
+ vrange_unlock(vroot);
+ mutex_unlock(&mapping->i_mmap_mutex);
+ return ret;
+}
+
+static int try_to_discard_vpage(struct page *page)
+{
+ if (PageAnon(page))
+ return try_to_discard_anon_vpage(page);
+ return try_to_discard_file_vpage(page);
+}
+
+int discard_vpage(struct page *page)
+{
+ VM_BUG_ON(!PageLocked(page));
+ VM_BUG_ON(PageLRU(page));
+
+ if (!try_to_discard_vpage(page)) {
+ if (PageSwapCache(page))
+ try_to_free_swap(page);
+
+ if (page_freeze_refs(page, 1)) {
+ unlock_page(page);
+ return 0;
+ }
+ }
+
+ return 1;
+}
--
1.7.9.5

2014-01-02 07:14:22

by Minchan Kim

[permalink] [raw]
Subject: [PATCH v10 08/16] vrange: Send SIGBUS when user try to access purged page

By vrange(2) semantic, a user should see SIGBUS if they try to
access purged page without marking the memory as non-voaltile
(ie, vrange(...VRANGE_NOVOLATILE)).

This allows for optimistic traversal of volatile pages, without
having to mark them non-volatile first and the SIGBUS allows
applications to trap and fixup the purged range before accessing
them again.

This patch implements it by adding SWP_VRANGE so it consumes one
from MAX_SWAPFILES. It means worst case of MAX_SWAPFILES in 32 bit
is 32 - 2 - 1 - 1 = 28. I think it's still enough for everybody.

Cc: Mel Gorman <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Dmitry Adamushko <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Neil Brown <[email protected]>
Cc: Andrea Righi <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Mike Hommey <[email protected]>
Cc: Taras Glek <[email protected]>
Cc: Dhaval Giani <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michel Lespinasse <[email protected]>
Cc: Rob Clark <[email protected]>
Cc: [email protected] <[email protected]>
Cc: John Stultz <[email protected]>
Signed-off-by: Minchan Kim <[email protected]>
---
include/linux/swap.h | 6 +++++-
include/linux/vrange.h | 17 ++++++++++++++++-
mm/memory.c | 35 +++++++++++++++++++++++++++++++++--
mm/mincore.c | 5 ++++-
mm/vrange.c | 19 +++++++++++++++++++
5 files changed, 77 insertions(+), 5 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 46ba0c6c219f..39b3d4c6aec9 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -49,6 +49,9 @@ static inline int current_is_kswapd(void)
* actions on faults.
*/

+#define SWP_VRANGE_NUM 1
+#define SWP_VRANGE (MAX_SWAPFILES + SWP_HWPOISON_NUM + SWP_MIGRATION_NUM)
+
/*
* NUMA node memory migration support
*/
@@ -71,7 +74,8 @@ static inline int current_is_kswapd(void)
#endif

#define MAX_SWAPFILES \
- ((1 << MAX_SWAPFILES_SHIFT) - SWP_MIGRATION_NUM - SWP_HWPOISON_NUM)
+ ((1 << MAX_SWAPFILES_SHIFT) - SWP_MIGRATION_NUM - SWP_HWPOISON_NUM \
+ - SWP_VRANGE_NUM)

/*
* Magic header for a swap area. The first part of the union is
diff --git a/include/linux/vrange.h b/include/linux/vrange.h
index 778902d9cc30..d9ce2ec53a34 100644
--- a/include/linux/vrange.h
+++ b/include/linux/vrange.h
@@ -3,6 +3,8 @@

#include <linux/vrange_types.h>
#include <linux/mm.h>
+#include <linux/swap.h>
+#include <linux/swapops.h>

#define vrange_from_node(node_ptr) \
container_of(node_ptr, struct vrange, node)
@@ -12,6 +14,16 @@

#ifdef CONFIG_MMU

+static inline swp_entry_t make_vrange_entry(void)
+{
+ return swp_entry(SWP_VRANGE, 0);
+}
+
+static inline int is_vrange_entry(swp_entry_t entry)
+{
+ return swp_type(entry) == SWP_VRANGE;
+}
+
static inline void vrange_root_init(struct vrange_root *vroot, int type,
void *object)
{
@@ -43,7 +55,8 @@ extern int vrange_fork(struct mm_struct *new,
struct mm_struct *old);
int discard_vpage(struct page *page);
bool vrange_addr_volatile(struct vm_area_struct *vma, unsigned long addr);
-
+extern bool vrange_addr_purged(struct vm_area_struct *vma,
+ unsigned long address);
#else

static inline void vrange_root_init(struct vrange_root *vroot,
@@ -60,5 +73,7 @@ static inline bool vrange_addr_volatile(struct vm_area_struct *vma,
return false;
}
static inline int discard_vpage(struct page *page) { return 0 };
+static inline bool vrange_addr_purged(struct vm_area_struct *vma,
+ unsigned long address);
#endif
#endif /* _LINIUX_VRANGE_H */
diff --git a/mm/memory.c b/mm/memory.c
index d176154c243f..86231180f01f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -59,6 +59,7 @@
#include <linux/gfp.h>
#include <linux/migrate.h>
#include <linux/string.h>
+#include <linux/vrange.h>

#include <asm/io.h>
#include <asm/pgalloc.h>
@@ -807,6 +808,8 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
if (unlikely(!pte_present(pte))) {
if (!pte_file(pte)) {
swp_entry_t entry = pte_to_swp_entry(pte);
+ if (is_vrange_entry(entry))
+ goto out_set_pte;

if (swap_duplicate(entry) < 0)
return entry.val;
@@ -1152,6 +1155,8 @@ again:
print_bad_pte(vma, addr, ptent, NULL);
} else {
swp_entry_t entry = pte_to_swp_entry(ptent);
+ if (is_vrange_entry(entry))
+ goto out;

if (!non_swap_entry(entry))
rss[MM_SWAPENTS]--;
@@ -1168,6 +1173,7 @@ again:
if (unlikely(!free_swap_and_cache(entry)))
print_bad_pte(vma, addr, ptent, NULL);
}
+out:
pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
} while (pte++, addr += PAGE_SIZE, addr != end);

@@ -3695,15 +3701,40 @@ static int handle_pte_fault(struct mm_struct *mm,

entry = *pte;
if (!pte_present(entry)) {
+ swp_entry_t vrange_entry;
+retry:
if (pte_none(entry)) {
if (vma->vm_ops) {
if (likely(vma->vm_ops->fault))
return do_linear_fault(mm, vma, address,
- pte, pmd, flags, entry);
+ pte, pmd, flags, entry);
}
return do_anonymous_page(mm, vma, address,
- pte, pmd, flags);
+ pte, pmd, flags);
+ }
+
+ vrange_entry = pte_to_swp_entry(entry);
+ if (unlikely(is_vrange_entry(vrange_entry))) {
+ if (!vrange_addr_purged(vma, address)) {
+ /*
+ * If the address is not in purged vrange,
+ * It means user already called NOVOLATILE
+ * vrange system call so we shouldn't send
+ * a SIGBUS. Intead, zap it and retry.
+ */
+ ptl = pte_lockptr(mm, pmd);
+ spin_lock(ptl);
+ if (unlikely(!pte_same(*pte, entry)))
+ goto unlock;
+ flush_cache_page(vma, address, pte_pfn(*pte));
+ ptep_clear_flush(vma, address, pte);
+ pte_unmap_unlock(pte, ptl);
+ goto retry;
+ }
+
+ return VM_FAULT_SIGBUS;
}
+
if (pte_file(entry))
return do_nonlinear_fault(mm, vma, address,
pte, pmd, flags, entry);
diff --git a/mm/mincore.c b/mm/mincore.c
index da2be56a7b8f..e6138048d735 100644
--- a/mm/mincore.c
+++ b/mm/mincore.c
@@ -15,6 +15,7 @@
#include <linux/swap.h>
#include <linux/swapops.h>
#include <linux/hugetlb.h>
+#include <linux/vrange.h>

#include <asm/uaccess.h>
#include <asm/pgtable.h>
@@ -129,7 +130,9 @@ static void mincore_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
} else { /* pte is a swap entry */
swp_entry_t entry = pte_to_swp_entry(pte);

- if (is_migration_entry(entry)) {
+ if (is_vrange_entry(entry)) {
+ *vec = 0;
+ } else if (is_migration_entry(entry)) {
/* migration entries are always uptodate */
*vec = 1;
} else {
diff --git a/mm/vrange.c b/mm/vrange.c
index 18afe94d3f13..f86ed33434d8 100644
--- a/mm/vrange.c
+++ b/mm/vrange.c
@@ -431,6 +431,24 @@ bool vrange_addr_volatile(struct vm_area_struct *vma, unsigned long addr)
return ret;
}

+bool vrange_addr_purged(struct vm_area_struct *vma, unsigned long addr)
+{
+ struct vrange_root *vroot;
+ struct vrange *range;
+ unsigned long vstart_idx;
+ bool ret = false;
+
+ vroot = __vma_to_vroot(vma);
+ vstart_idx = __vma_addr_to_index(vma, addr);
+
+ vrange_lock(vroot);
+ range = __vrange_find(vroot, vstart_idx, vstart_idx);
+ if (range && range->purged)
+ ret = true;
+ vrange_unlock(vroot);
+ return ret;
+}
+
/* Caller should hold vrange_lock */
static void do_purge(struct vrange_root *vroot,
unsigned long start_idx, unsigned long end_idx)
@@ -474,6 +492,7 @@ static void try_to_discard_one(struct vrange_root *vroot, struct page *page,
page_remove_rmap(page);
page_cache_release(page);

+ set_pte_at(mm, addr, pte, swp_entry_to_pte(make_vrange_entry()));
pte_unmap_unlock(pte, ptl);
mmu_notifier_invalidate_page(mm, addr);

--
1.7.9.5

2014-01-02 07:14:19

by Minchan Kim

[permalink] [raw]
Subject: [PATCH v10 07/16] vrange: Purge volatile pages when memory is tight

This patch adds purging logic of volatile pages into direct
reclaim path so that if vrange pages are selected as victim,
they could be discarded rather than swapping out.

Direct purging doesn't consider volatile page's recency because it
would be better to free the page rather than swapping out
another working set pages. This makes sense because userspace
specifies "please remove free these pages when memory is tight"
via the vrange syscall.

This however is an in-kernel behavior and the purging logic
could later change. Applications should not assume anything
about the volatile page purging order, much as they shouldn't
assume anything about the page swapout order.

Cc: Mel Gorman <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Cc: Michel Lespinasse <[email protected]>
Cc: Johannes Weiner <[email protected]>
[jstultz: commit log tweaks]
Signed-off-by: John Stultz <[email protected]>
Signed-off-by: Minchan Kim <[email protected]>
---
mm/vmscan.c | 17 +++++++++++++++++
1 file changed, 17 insertions(+)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 8bff386e65a0..630723812ce3 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -43,6 +43,7 @@
#include <linux/sysctl.h>
#include <linux/oom.h>
#include <linux/prefetch.h>
+#include <linux/vrange.h>

#include <asm/tlbflush.h>
#include <asm/div64.h>
@@ -665,6 +666,7 @@ enum page_references {
PAGEREF_RECLAIM,
PAGEREF_RECLAIM_CLEAN,
PAGEREF_KEEP,
+ PAGEREF_DISCARD,
PAGEREF_ACTIVATE,
};

@@ -685,6 +687,13 @@ static enum page_references page_check_references(struct page *page,
if (vm_flags & VM_LOCKED)
return PAGEREF_RECLAIM;

+ /*
+ * If volatile page is reached on LRU's tail, we discard the
+ * page without considering recycle the page.
+ */
+ if (vm_flags & VM_VRANGE)
+ return PAGEREF_DISCARD;
+
if (referenced_ptes) {
if (PageSwapBacked(page))
return PAGEREF_ACTIVATE;
@@ -912,6 +921,14 @@ static unsigned long shrink_page_list(struct list_head *page_list,
switch (references) {
case PAGEREF_ACTIVATE:
goto activate_locked;
+ /*
+ * NOTE: Do not change case ordering.
+ * If you should change, update mapping after discard_vpage
+ * because page->mapping could be NULL if it's purged.
+ */
+ case PAGEREF_DISCARD:
+ if (may_enter_fs && discard_vpage(page) == 0)
+ goto free_it;
case PAGEREF_KEEP:
goto keep_locked;
case PAGEREF_RECLAIM:
--
1.7.9.5

2014-01-02 07:14:17

by Minchan Kim

[permalink] [raw]
Subject: [PATCH v10 06/16] vrange: introduce fake VM_VRANGE flag

This patch introduce fake VM_VRANGE flag in vma->vm_flags.
Actually, vma->vm_flags doesn't have such flag and it is just
used to detect a page is volatile page or not in page_referenced.

For it, page_referenced's vm_flags argument semantic is changed so that
caller should specify what kinds of flags he has interest.
It could make to avoid unnecessary volatile range lookup in
page_referenced_one.

Cc: Mel Gorman <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Cc: Michel Lespinasse <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: John Stultz <[email protected]>
Signed-off-by: Minchan Kim <[email protected]>
---
include/linux/mm.h | 9 +++++++++
mm/rmap.c | 17 +++++++++++++----
mm/vmscan.c | 4 ++--
3 files changed, 24 insertions(+), 6 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8b6e55ee8855..3dec30154f96 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -103,6 +103,15 @@ extern unsigned int kobjsize(const void *objp);
#define VM_IO 0x00004000 /* Memory mapped I/O or similar */

/* Used by sys_madvise() */
+/*
+ * VM_VRANGE is rather special. Actually, vma->vm_flags doesn't have such flag.
+ * It is used to identify whether a page put on volatile range or not
+ * by page_referenced. So, if we are lack of new bit in vmflags, we could
+ * replace it with assembling exclusive flags.
+ *
+ * ex) VM_HUGEPAGE|VM_NOHUGEPAGE
+ */
+#define VM_VRANGE 0x00001000
#define VM_SEQ_READ 0x00008000 /* App will access data sequentially */
#define VM_RAND_READ 0x00010000 /* App will not benefit from clustered reads */

diff --git a/mm/rmap.c b/mm/rmap.c
index fd3ee7a54a13..9220f12deb93 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -57,6 +57,7 @@
#include <linux/migrate.h>
#include <linux/hugetlb.h>
#include <linux/backing-dev.h>
+#include <linux/vrange.h>

#include <asm/tlbflush.h>

@@ -685,7 +686,7 @@ int page_referenced_one(struct page *page, struct vm_area_struct *vma,
if (vma->vm_flags & VM_LOCKED) {
spin_unlock(&mm->page_table_lock);
*mapcount = 0; /* break early from loop */
- *vm_flags |= VM_LOCKED;
+ *vm_flags &= VM_LOCKED;
goto out;
}

@@ -708,7 +709,7 @@ int page_referenced_one(struct page *page, struct vm_area_struct *vma,
if (vma->vm_flags & VM_LOCKED) {
pte_unmap_unlock(pte, ptl);
*mapcount = 0; /* break early from loop */
- *vm_flags |= VM_LOCKED;
+ *vm_flags &= VM_LOCKED;
goto out;
}

@@ -724,12 +725,18 @@ int page_referenced_one(struct page *page, struct vm_area_struct *vma,
referenced++;
}
pte_unmap_unlock(pte, ptl);
+ if (*vm_flags & VM_VRANGE &&
+ vrange_addr_volatile(vma, address)) {
+ *mapcount = 0; /* break ealry from loop */
+ *vm_flags &= VM_VRANGE;
+ goto out;
+ }
}

(*mapcount)--;

if (referenced)
- *vm_flags |= vma->vm_flags;
+ *vm_flags &= vma->vm_flags;
out:
return referenced;
}
@@ -844,6 +851,9 @@ static int page_referenced_file(struct page *page,
*
* Quick test_and_clear_referenced for all mappings to a page,
* returns the number of ptes which referenced the page.
+ *
+ * NOTE: caller should pass interested flags in vm_flags to collect
+ * vma->vm_flags.
*/
int page_referenced(struct page *page,
int is_locked,
@@ -853,7 +863,6 @@ int page_referenced(struct page *page,
int referenced = 0;
int we_locked = 0;

- *vm_flags = 0;
if (page_mapped(page) && page_rmapping(page)) {
if (!is_locked && (!PageAnon(page) || PageKsm(page))) {
we_locked = trylock_page(page);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index eea668d9cff6..8bff386e65a0 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -672,7 +672,7 @@ static enum page_references page_check_references(struct page *page,
struct scan_control *sc)
{
int referenced_ptes, referenced_page;
- unsigned long vm_flags;
+ unsigned long vm_flags = VM_EXEC|VM_LOCKED|VM_VRANGE;

referenced_ptes = page_referenced(page, 1, sc->target_mem_cgroup,
&vm_flags);
@@ -1619,7 +1619,6 @@ static void shrink_active_list(unsigned long nr_to_scan,
{
unsigned long nr_taken;
unsigned long nr_scanned;
- unsigned long vm_flags;
LIST_HEAD(l_hold); /* The pages which were snipped off */
LIST_HEAD(l_active);
LIST_HEAD(l_inactive);
@@ -1652,6 +1651,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
spin_unlock_irq(&zone->lru_lock);

while (!list_empty(&l_hold)) {
+ unsigned long vm_flags = VM_EXEC;
cond_resched();
page = lru_to_page(&l_hold);
list_del(&page->lru);
--
1.7.9.5

2014-01-02 07:14:14

by Minchan Kim

[permalink] [raw]
Subject: [PATCH v10 16/16] vrange: Add vmstat counter about purged page

Adds some vmstat for analysise vrange working.

[PGDISCARD|PGVSCAN]_[KSWAPD|DIRECT] means purged page/scanning
so we could see effectiveness of vrange.

PGDISCARD_RESCUED means how many of pages we are missing in
core discarding logic of vrange so if it is big in no big memory
pressure, it may have a problem in scanning logic.

PGDISCARD_SAVE_RECLAIM means how many time we avoid reclaim via
discarding volatile pages but not sure how it is exact because
sc->nr_to_reclaim is very high if it were sc->prioirty is low(ie,
high memory pressure) so it it hard to meet the condition.
Maybe I would change the check via zone_watermark_ok.

Cc: Mel Gorman <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Cc: Michel Lespinasse <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: John Stultz <[email protected]>
Signed-off-by: Minchan Kim <[email protected]>
---
include/linux/vm_event_item.h | 6 ++++++
mm/vmscan.c | 8 ++++++--
mm/vmstat.c | 6 ++++++
mm/vrange.c | 14 ++++++++++++++
4 files changed, 32 insertions(+), 2 deletions(-)

diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 1855f0a22add..df0d8e9e0540 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -25,6 +25,12 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
FOR_ALL_ZONES(PGALLOC),
PGFREE, PGACTIVATE, PGDEACTIVATE,
PGFAULT, PGMAJFAULT,
+ PGVSCAN_KSWAPD,
+ PGVSCAN_DIRECT,
+ PGDISCARD_KSWAPD,
+ PGDISCARD_DIRECT,
+ PGDISCARD_RESCUED, /* rescued from shrink_page_list */
+ PGDISCARD_SAVE_RECLAIM, /* how many save reclaim */
FOR_ALL_ZONES(PGREFILL),
FOR_ALL_ZONES(PGSTEAL_KSWAPD),
FOR_ALL_ZONES(PGSTEAL_DIRECT),
diff --git a/mm/vmscan.c b/mm/vmscan.c
index d8f45af1ab84..c88e48be010b 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -886,8 +886,10 @@ static unsigned long shrink_page_list(struct list_head *page_list,
* because page->mapping could be NULL if it's purged.
*/
case PAGEREF_DISCARD:
- if (may_enter_fs && discard_vpage(page) == 0)
+ if (may_enter_fs && discard_vpage(page) == 0) {
+ count_vm_event(PGDISCARD_RESCUED);
goto free_it;
+ }
case PAGEREF_KEEP:
goto keep_locked;
case PAGEREF_RECLAIM:
@@ -1768,8 +1770,10 @@ static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
unsigned long nr_reclaimed;

nr_reclaimed = shrink_vrange(lru, lruvec, sc);
- if (nr_reclaimed >= sc->nr_to_reclaim)
+ if (nr_reclaimed >= sc->nr_to_reclaim) {
+ count_vm_event(PGDISCARD_SAVE_RECLAIM);
return nr_reclaimed;
+ }

if (is_active_lru(lru)) {
if (inactive_list_is_low(lruvec, lru))
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 9bb314577911..fa4eea4c5499 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -789,6 +789,12 @@ const char * const vmstat_text[] = {

"pgfault",
"pgmajfault",
+ "pgvscan_kswapd",
+ "pgvscan_direct",
+ "pgdiscard_kswapd",
+ "pgdiscard_direct",
+ "pgdiscard_rescued",
+ "pgdiscard_save_reclaim",

TEXTS_FOR_ZONES("pgrefill")
TEXTS_FOR_ZONES("pgsteal_kswapd")
diff --git a/mm/vrange.c b/mm/vrange.c
index 6cdbf6feed26..16de0a085453 100644
--- a/mm/vrange.c
+++ b/mm/vrange.c
@@ -1223,6 +1223,7 @@ static int discard_vrange(struct vrange *vrange, unsigned long *nr_discard,
{
int ret = 0;
struct vrange_root *vroot;
+ unsigned long total_scan = *scan;
vroot = vrange->owner;

vroot = vrange_get_vroot(vrange);
@@ -1244,6 +1245,19 @@ static int discard_vrange(struct vrange *vrange, unsigned long *nr_discard,
ret = __discard_vrange_file(mapping, vrange, nr_discard, scan);
}

+ if (!ret) {
+ if (current_is_kswapd())
+ count_vm_events(PGDISCARD_KSWAPD, *nr_discard);
+ else
+ count_vm_events(PGDISCARD_DIRECT, *nr_discard);
+ }
+
+ if (current_is_kswapd())
+ count_vm_events(PGVSCAN_KSWAPD,
+ (total_scan - *scan) >> PAGE_SHIFT);
+ else
+ count_vm_events(PGVSCAN_DIRECT,
+ (total_scan - *scan) >> PAGE_SHIFT);
out:
__vroot_put(vroot);
return ret;
--
1.7.9.5

2014-01-02 07:14:13

by Minchan Kim

[permalink] [raw]
Subject: [PATCH v10 15/16] vrange: Prevent unnecessary scanning

Now, we scan and discard volatile pages per vrange size but vrange
size is virtual address so we couldn't imagine how many of rss be
there. It could make too excessive scanning in reclaim path if
the range is too big but doesn't have rss so that CPU burns out.

Another problem is we always start from vrange's starting address
everytime although many of pages are already purged in previous
iteration so that it ends up CPU buring, too.

This patch keeps previous scan address in vrange's hint variable
so that we could avoid unnecessary scanning in next round.
Even, if we purge all of pages in the range, we could skip the
vrange.

Cc: Mel Gorman <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Cc: Michel Lespinasse <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: John Stultz <[email protected]>
Signed-off-by: Minchan Kim <[email protected]>
---
mm/vrange.c | 107 ++++++++++++++++++++++++++++++++++++++++++++++++++---------
1 file changed, 91 insertions(+), 16 deletions(-)

diff --git a/mm/vrange.c b/mm/vrange.c
index df01c6b084bf..6cdbf6feed26 100644
--- a/mm/vrange.c
+++ b/mm/vrange.c
@@ -31,6 +31,11 @@ struct vrange_walker {

#define VRANGE_PURGED_MARK 0

+/*
+ * [mark|clear]_purge could invalidate cached address but it's rare
+ * and at the worst case, some address range would be rescan or skip
+ * so it isn't critical for integrity point of view.
+ */
void mark_purge(struct vrange *range)
{
range->hint |= (1 << VRANGE_PURGED_MARK);
@@ -47,9 +52,36 @@ bool vrange_purged(struct vrange *range)
return purged;
}

-static inline unsigned long vrange_size(struct vrange *range)
+void record_scan_addr(struct vrange *range, unsigned long addr)
{
- return range->node.last + 1 - range->node.start;
+ unsigned long old, new, ret;
+
+ BUG_ON(addr & ~PAGE_MASK);
+
+ /*
+ * hint variable is shared by cache address and purged flag.
+ * purged flag is modified while we hold vrange_lock but
+ * cache address is modified without any lock so that it
+ * could invalidate purged flag by racing do_purge, which
+ * is critical. The cmpxchg should prevent it.
+ */
+ do {
+ old = range->hint;
+ new = old | addr;
+ ret = cmpxchg(&range->hint, old, new);
+ } while (ret != old);
+
+ BUG_ON(addr && addr > range->node.last + 1);
+ BUG_ON(addr && addr < range->node.start);
+}
+
+unsigned long load_scan_addr(struct vrange *range)
+{
+ unsigned long cached_addr = range->hint & PAGE_MASK;
+ BUG_ON(cached_addr && cached_addr > range->node.last + 1);
+ BUG_ON(cached_addr && cached_addr < range->node.start);
+
+ return cached_addr;
}

static void vroot_ctor(void *data)
@@ -259,6 +291,14 @@ static inline void __vrange_lru_add(struct vrange *range)
spin_unlock(&vrange_list.lock);
}

+static inline void __vrange_lru_add_tail(struct vrange *range)
+{
+ spin_lock(&vrange_list.lock);
+ WARN_ON(!list_empty(&range->lru));
+ list_add_tail(&range->lru, &vrange_list.list);
+ spin_unlock(&vrange_list.lock);
+}
+
static inline void __vrange_lru_del(struct vrange *range)
{
spin_lock(&vrange_list.lock);
@@ -306,6 +346,9 @@ static inline void __vrange_set(struct vrange *range,
{
range->node.start = start_idx;
range->node.last = end_idx;
+
+ /* If resize happens, invalidate cache addr */
+ range->hint = 0;
if (purged)
mark_purge(range);
else
@@ -1069,12 +1112,13 @@ static unsigned long discard_vma_pages(struct mm_struct *mm,
* so avoid touching vrange->owner.
*/
static int __discard_vrange_anon(struct mm_struct *mm, struct vrange *vrange,
- unsigned long *ret_discard)
+ unsigned long *ret_discard, unsigned long *scan)
{
struct vm_area_struct *vma;
unsigned long nr_discard = 0;
unsigned long start = vrange->node.start;
unsigned long end = vrange->node.last + 1;
+ unsigned long cached_addr;
int ret = 0;

/* It prevent to destroy vma when the process exist */
@@ -1087,6 +1131,10 @@ static int __discard_vrange_anon(struct mm_struct *mm, struct vrange *vrange,
goto out; /* this vrange could be retried */
}

+ cached_addr = load_scan_addr(vrange);
+ if (cached_addr)
+ start = cached_addr;
+
vma = find_vma(mm, start);
if (!vma || (vma->vm_start >= end))
goto out_unlock;
@@ -1097,10 +1145,18 @@ static int __discard_vrange_anon(struct mm_struct *mm, struct vrange *vrange,
BUG_ON(vma->vm_flags & (VM_SPECIAL|VM_LOCKED|VM_MIXEDMAP|
VM_HUGETLB));
cond_resched();
- nr_discard += discard_vma_pages(mm, vma,
- max_t(unsigned long, start, vma->vm_start),
- min_t(unsigned long, end, vma->vm_end));
+
+ start = max(start, vma->vm_start);
+ end = min(end, vma->vm_end);
+ end = min(start + *scan, end);
+
+ nr_discard += discard_vma_pages(mm, vma, start, end);
+ *scan -= (end - start);
+ if (!*scan)
+ break;
}
+
+ record_scan_addr(vrange, end);
out_unlock:
up_read(&mm->mmap_sem);
mmput(mm);
@@ -1110,18 +1166,27 @@ out:
}

static int __discard_vrange_file(struct address_space *mapping,
- struct vrange *vrange, unsigned long *ret_discard)
+ struct vrange *vrange, unsigned long *ret_discard,
+ unsigned long *scan)
{
struct pagevec pvec;
pgoff_t index;
int i, ret = 0;
+ unsigned long cached_addr;
unsigned long nr_discard = 0;
unsigned long start_idx = vrange->node.start;
unsigned long end_idx = vrange->node.last;
const pgoff_t start = start_idx >> PAGE_CACHE_SHIFT;
- pgoff_t end = end_idx >> PAGE_CACHE_SHIFT;
+ pgoff_t end;
LIST_HEAD(pagelist);

+ cached_addr = load_scan_addr(vrange);
+ if (cached_addr)
+ start_idx = cached_addr;
+
+ end_idx = min(start_idx + *scan, end_idx);
+ end = end_idx >> PAGE_CACHE_SHIFT;
+
pagevec_init(&pvec, 0);
index = start;
while (index <= end && pagevec_lookup(&pvec, mapping, index,
@@ -1141,16 +1206,20 @@ static int __discard_vrange_file(struct address_space *mapping,
index++;
}

+ *scan -= (end_idx + 1 - start_idx);
+
if (!list_empty(&pagelist))
nr_discard = discard_vrange_pagelist(&pagelist);

+ record_scan_addr(vrange, end_idx + 1);
*ret_discard = nr_discard;
putback_lru_pages(&pagelist);

return ret;
}

-static int discard_vrange(struct vrange *vrange, unsigned long *nr_discard)
+static int discard_vrange(struct vrange *vrange, unsigned long *nr_discard,
+ unsigned long *scan)
{
int ret = 0;
struct vrange_root *vroot;
@@ -1169,10 +1238,10 @@ static int discard_vrange(struct vrange *vrange, unsigned long *nr_discard)

if (vroot->type == VRANGE_MM) {
struct mm_struct *mm = vroot->object;
- ret = __discard_vrange_anon(mm, vrange, nr_discard);
+ ret = __discard_vrange_anon(mm, vrange, nr_discard, scan);
} else if (vroot->type == VRANGE_FILE) {
struct address_space *mapping = vroot->object;
- ret = __discard_vrange_file(mapping, vrange, nr_discard);
+ ret = __discard_vrange_file(mapping, vrange, nr_discard, scan);
}

out:
@@ -1188,7 +1257,7 @@ unsigned long shrink_vrange(enum lru_list lru, struct lruvec *lruvec,
int retry = 10;
struct vrange *range;
unsigned long nr_to_reclaim, total_reclaimed = 0;
- unsigned long long scan_threshold = VRANGE_SCAN_THRESHOLD;
+ unsigned long remained_scan = VRANGE_SCAN_THRESHOLD;

if (!(sc->gfp_mask & __GFP_IO))
return 0;
@@ -1209,7 +1278,7 @@ unsigned long shrink_vrange(enum lru_list lru, struct lruvec *lruvec,

nr_to_reclaim = sc->nr_to_reclaim;

- while (nr_to_reclaim > 0 && scan_threshold > 0 && retry) {
+ while (nr_to_reclaim > 0 && remained_scan > 0 && retry) {
unsigned long nr_reclaimed = 0;
int ret;

@@ -1224,9 +1293,7 @@ unsigned long shrink_vrange(enum lru_list lru, struct lruvec *lruvec,
continue;
}

- ret = discard_vrange(range, &nr_reclaimed);
- scan_threshold -= vrange_size(range);
-
+ ret = discard_vrange(range, &nr_reclaimed, &remained_scan);
/* If it's EAGAIN, retry it after a little */
if (ret == -EAGAIN) {
retry--;
@@ -1235,6 +1302,14 @@ unsigned long shrink_vrange(enum lru_list lru, struct lruvec *lruvec,
continue;
}

+ if (load_scan_addr(range) < range->node.last) {
+ /*
+ * We like full range purging of a range rather than
+ * partial range purging of all ranges for fairness.
+ */
+ __vrange_lru_add_tail(range);
+ }
+
__vrange_put(range);
retry = 10;

--
1.7.9.5

2014-01-02 07:14:10

by Minchan Kim

[permalink] [raw]
Subject: [PATCH v10 13/16] vrange: Allocate vroot dynamically

This patch allocates vroot dynamically when vrange syscall is called
so if anybody doesn't call vrange syscall, we don't waste memory space
occupied by vroot.

The vroot is allocated by SLAB_DESTROY_BY_RCU, thus because we can't
guarantee vroot's validity when we are about to access vroot of
a different process, the rules are as follows:

1. rcu_read_lock
2. checkt vroot == NULL
3. increment vroot's refcount
4. rcu_read_unlock
5. vrange_lock(vroot)
6. get vrange from tree
7. vrange->owenr == vroot check again because vroot can be allocated
for another one in same RCU period.

If we're accessing the vroot from our own context, we can skip
the rcu & extra checking, since we know the vroot won't disappear
from under us while we're running.

Cc: Mel Gorman <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Cc: Michel Lespinasse <[email protected]>
Cc: Johannes Weiner <[email protected]>
[jstultz: Commit rewording, renamed functions, added helper functions]
Signed-off-by: John Stultz <[email protected]>
Signed-off-by: Minchan Kim <[email protected]>
---
fs/inode.c | 4 +-
include/linux/fs.h | 2 +-
include/linux/mm_types.h | 2 +-
include/linux/vrange.h | 2 -
include/linux/vrange_types.h | 1 +
kernel/fork.c | 5 +-
mm/mmap.c | 2 +-
mm/vrange.c | 267 ++++++++++++++++++++++++++++++++++++++++--
8 files changed, 266 insertions(+), 19 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index b029472134ea..2f0f878be213 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -354,7 +354,6 @@ void address_space_init_once(struct address_space *mapping)
spin_lock_init(&mapping->private_lock);
mapping->i_mmap = RB_ROOT;
INIT_LIST_HEAD(&mapping->i_mmap_nonlinear);
- vrange_root_init(&mapping->vroot, VRANGE_FILE, mapping);
}
EXPORT_SYMBOL(address_space_init_once);

@@ -1390,7 +1389,8 @@ static void iput_final(struct inode *inode)
inode_lru_list_del(inode);
spin_unlock(&inode->i_lock);

- vrange_root_cleanup(&inode->i_mapping->vroot);
+ vrange_root_cleanup(inode->i_mapping->vroot);
+ inode->i_mapping->vroot = NULL;

evict(inode);
}
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 19b70288e219..a01fb319499b 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -416,7 +416,7 @@ struct address_space {
struct list_head i_mmap_nonlinear;/*list VM_NONLINEAR mappings */
struct mutex i_mmap_mutex; /* protect tree, count, list */
#ifdef CONFIG_MMU
- struct vrange_root vroot;
+ struct vrange_root *vroot;
#endif
/* Protected by tree_lock together with the radix tree */
unsigned long nrpages; /* number of total pages */
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index a4de9cfa8ff1..a46f565341a1 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -352,7 +352,7 @@ struct mm_struct {


#ifdef CONFIG_MMU
- struct vrange_root vroot;
+ struct vrange_root *vroot;
#endif
unsigned long hiwater_rss; /* High-watermark of RSS usage */
unsigned long hiwater_vm; /* High-water virtual memory usage */
diff --git a/include/linux/vrange.h b/include/linux/vrange.h
index eba155a0263c..d69262edf986 100644
--- a/include/linux/vrange.h
+++ b/include/linux/vrange.h
@@ -30,8 +30,6 @@ static inline void vrange_root_init(struct vrange_root *vroot, int type,
void *object)
{
vroot->type = type;
- vroot->v_rb = RB_ROOT;
- mutex_init(&vroot->v_lock);
vroot->object = object;
}

diff --git a/include/linux/vrange_types.h b/include/linux/vrange_types.h
index d7d451cd50b6..c4ef8b69a0a1 100644
--- a/include/linux/vrange_types.h
+++ b/include/linux/vrange_types.h
@@ -14,6 +14,7 @@ struct vrange_root {
struct mutex v_lock; /* Protect v_rb */
enum vrange_type type; /* range root type */
void *object; /* pointer to mm_struct or mapping */
+ atomic_t refcount;
};

struct vrange {
diff --git a/kernel/fork.c b/kernel/fork.c
index 36d3c4bb4c4d..81960d6b01b3 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -542,9 +542,9 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p)
(current->mm->flags & MMF_INIT_MASK) : default_dump_filter;
mm->core_state = NULL;
mm->nr_ptes = 0;
+ mm->vroot = NULL;
memset(&mm->rss_stat, 0, sizeof(mm->rss_stat));
spin_lock_init(&mm->page_table_lock);
- vrange_root_init(&mm->vroot, VRANGE_MM, mm);
mm_init_aio(mm);
mm_init_owner(mm, p);

@@ -616,7 +616,8 @@ void mmput(struct mm_struct *mm)

if (atomic_dec_and_test(&mm->mm_users)) {
uprobe_clear_state(mm);
- vrange_root_cleanup(&mm->vroot);
+ vrange_root_cleanup(mm->vroot);
+ mm->vroot = NULL;
exit_aio(mm);
ksm_exit(mm);
khugepaged_exit(mm); /* must run before exit_mmap */
diff --git a/mm/mmap.c b/mm/mmap.c
index b8e2c1e57336..115698d53f7a 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1506,7 +1506,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
munmap_back:

/* zap any volatile ranges */
- vrange_clear(&mm->vroot, addr, addr + len);
+ vrange_clear(mm->vroot, addr, addr + len);

if (find_vma_links(mm, addr, addr + len, &prev, &rb_link, &rb_parent)) {
if (do_munmap(mm, addr, len))
diff --git a/mm/vrange.c b/mm/vrange.c
index 51875f256592..4e0775b722af 100644
--- a/mm/vrange.c
+++ b/mm/vrange.c
@@ -17,6 +17,7 @@
#include <linux/shmem_fs.h>

static struct kmem_cache *vrange_cachep;
+static struct kmem_cache *vroot_cachep;

static struct vrange_list {
struct list_head list;
@@ -33,16 +34,182 @@ static inline unsigned long vrange_size(struct vrange *range)
return range->node.last + 1 - range->node.start;
}

+static void vroot_ctor(void *data)
+{
+ struct vrange_root *vroot = data;
+
+ atomic_set(&vroot->refcount, 0);
+ mutex_init(&vroot->v_lock);
+ vroot->v_rb = RB_ROOT;
+}
+
static int __init vrange_init(void)
{
INIT_LIST_HEAD(&vrange_list.list);
spin_lock_init(&vrange_list.lock);

+ vroot_cachep = kmem_cache_create("vrange_root",
+ sizeof(struct vrange_root), 0,
+ SLAB_DESTROY_BY_RCU|SLAB_PANIC, vroot_ctor);
vrange_cachep = KMEM_CACHE(vrange, SLAB_PANIC);
return 0;
}
module_init(vrange_init);

+static struct vrange_root *__vroot_alloc(gfp_t flags)
+{
+ struct vrange_root *vroot = kmem_cache_alloc(vroot_cachep, flags);
+ if (!vroot)
+ return vroot;
+
+ atomic_set(&vroot->refcount, 1);
+ return vroot;
+}
+
+static inline int __vroot_get(struct vrange_root *vroot)
+{
+ if (!atomic_inc_not_zero(&vroot->refcount))
+ return 0;
+
+ return 1;
+}
+
+static inline void __vroot_put(struct vrange_root *vroot)
+{
+ if (atomic_dec_and_test(&vroot->refcount)) {
+ enum {VRANGE_MM, VRANGE_FILE} type = vroot->type;
+ if (type == VRANGE_MM) {
+ struct mm_struct *mm = vroot->object;
+ mmdrop(mm);
+ } else if (type == VRANGE_FILE) {
+ /* TODO : */
+ } else
+ BUG();
+
+ WARN_ON(!RB_EMPTY_ROOT(&vroot->v_rb));
+ kmem_cache_free(vroot_cachep, vroot);
+ }
+}
+
+static bool __vroot_init_mm(struct vrange_root *vroot, struct mm_struct *mm)
+{
+ bool ret = false;
+
+ spin_lock(&mm->page_table_lock);
+ if (!mm->vroot) {
+ mm->vroot = vroot;
+ vrange_root_init(mm->vroot, VRANGE_MM, mm);
+ atomic_inc(&mm->mm_count);
+ ret = true;
+ }
+ spin_unlock(&mm->page_table_lock);
+
+ return ret;
+}
+
+static bool __vroot_init_mapping(struct vrange_root *vroot,
+ struct address_space *mapping)
+{
+ bool ret = false;
+
+ mutex_lock(&mapping->i_mmap_mutex);
+ if (!mapping->vroot) {
+ mapping->vroot = vroot;
+ vrange_root_init(mapping->vroot, VRANGE_FILE, mapping);
+ /* XXX - inc ref count on mapping? */
+ ret = true;
+ }
+ mutex_unlock(&mapping->i_mmap_mutex);
+
+ return ret;
+}
+
+static struct vrange_root *vroot_alloc_mm(struct mm_struct *mm)
+{
+ struct vrange_root *ret, *allocated;
+
+ ret = NULL;
+ allocated = __vroot_alloc(GFP_NOFS);
+ if (!allocated)
+ return NULL;
+
+ if (__vroot_init_mm(allocated, mm)) {
+ ret = allocated;
+ allocated = NULL;
+ }
+
+ if (allocated)
+ __vroot_put(allocated);
+
+ return ret;
+}
+
+static struct vrange_root *vroot_alloc_vma(struct vm_area_struct *vma)
+{
+ struct vrange_root *ret, *allocated;
+ bool val;
+
+ ret = NULL;
+ allocated = __vroot_alloc(GFP_KERNEL);
+ if (!allocated)
+ return NULL;
+
+ if (vma->vm_file && (vma->vm_flags & VM_SHARED))
+ val = __vroot_init_mapping(allocated, vma->vm_file->f_mapping);
+ else
+ val = __vroot_init_mm(allocated, vma->vm_mm);
+
+ if (val) {
+ ret = allocated;
+ allocated = NULL;
+ }
+
+ if (allocated)
+ kmem_cache_free(vroot_cachep, allocated);
+
+ return ret;
+}
+
+static struct vrange_root *vrange_get_vroot(struct vrange *vrange)
+{
+ struct vrange_root *vroot;
+ struct vrange_root *ret = NULL;
+
+ rcu_read_lock();
+ /*
+ * Prevent compiler from re-fetching vrange->owner while others
+ * clears vrange->owner.
+ */
+ vroot = ACCESS_ONCE(vrange->owner);
+ if (!vroot)
+ goto out;
+
+ /*
+ * vroot couldn't be destroyed while we're holding rcu_read_lock
+ * so it's okay to access vroot
+ */
+ if (!__vroot_get(vroot))
+ goto out;
+
+
+ /* If we reach here, vroot is either ours or others because
+ * vroot could be allocated for othres in same RCU period
+ * so we should check it carefully. For free/reallocating
+ * for others, all vranges from vroot->tree should be detached
+ * firstly right before vroot freeing so if we check vrange->owner
+ * isn't NULL, it means vroot is ours.
+ */
+ smp_rmb();
+ if (!vrange->owner) {
+ __vroot_put(vroot);
+ goto out;
+ }
+ ret = vroot;
+out:
+ rcu_read_unlock();
+ return ret;
+}
+
static struct vrange *__vrange_alloc(gfp_t flags)
{
struct vrange *vrange = kmem_cache_alloc(vrange_cachep, flags);
@@ -197,6 +364,9 @@ static int vrange_remove(struct vrange_root *vroot,
struct interval_tree_node *node, *next;
bool used_new = false;

+ if (!vroot)
+ return 0;
+
if (!purged)
return -EINVAL;

@@ -267,6 +437,9 @@ void vrange_root_cleanup(struct vrange_root *vroot)
struct vrange *range;
struct rb_node *node;

+ if (vroot == NULL)
+ return;
+
vrange_lock(vroot);
/* We should remove node by post-order traversal */
while ((node = rb_first(&vroot->v_rb))) {
@@ -275,6 +448,12 @@ void vrange_root_cleanup(struct vrange_root *vroot)
__vrange_put(range);
}
vrange_unlock(vroot);
+ /*
+ * Before removing vroot, we should make sure range-owner
+ * should be NULL. See the smp_rmb of vrange_get_vroot.
+ */
+ smp_wmb();
+ __vroot_put(vroot);
}

/*
@@ -282,6 +461,7 @@ void vrange_root_cleanup(struct vrange_root *vroot)
* can't have copied own vrange data structure so that pages in the
* vrange couldn't be purged. It would be better rather than failing
* fork.
+ * The down_write of both mm->mmap_sem protects mm->vroot race.
*/
int vrange_fork(struct mm_struct *new_mm, struct mm_struct *old_mm)
{
@@ -289,8 +469,14 @@ int vrange_fork(struct mm_struct *new_mm, struct mm_struct *old_mm)
struct vrange *range, *new_range;
struct rb_node *next;

- new = &new_mm->vroot;
- old = &old_mm->vroot;
+ if (!old_mm->vroot)
+ return 0;
+
+ new = vroot_alloc_mm(new_mm);
+ if (!new)
+ return -ENOMEM;
+
+ old = old_mm->vroot;

vrange_lock(old);
next = rb_first(&old->v_rb);
@@ -311,6 +497,7 @@ int vrange_fork(struct mm_struct *new_mm, struct mm_struct *old_mm)

}
vrange_unlock(old);
+
return 0;
fail:
vrange_unlock(old);
@@ -323,9 +510,27 @@ static inline struct vrange_root *__vma_to_vroot(struct vm_area_struct *vma)
struct vrange_root *vroot = NULL;

if (vma->vm_file && (vma->vm_flags & VM_SHARED))
- vroot = &vma->vm_file->f_mapping->vroot;
+ vroot = vma->vm_file->f_mapping->vroot;
else
- vroot = &vma->vm_mm->vroot;
+ vroot = vma->vm_mm->vroot;
+
+ return vroot;
+}
+
+static inline struct vrange_root *__vma_to_vroot_get(struct vm_area_struct *vma)
+{
+ struct vrange_root *vroot = NULL;
+
+ rcu_read_lock();
+ vroot = __vma_to_vroot(vma);
+
+ if (!vroot)
+ goto out;
+
+ if (!__vroot_get(vroot))
+ vroot = NULL;
+out:
+ rcu_read_unlock();
return vroot;
}

@@ -371,6 +576,11 @@ static ssize_t do_vrange(struct mm_struct *mm, unsigned long start_idx,
tmp = end_idx;

vroot = __vma_to_vroot(vma);
+ if (!vroot)
+ vroot = vroot_alloc_vma(vma);
+ if (!vroot)
+ goto out;
+
vstart_idx = __vma_addr_to_index(vma, start_idx);
vend_idx = __vma_addr_to_index(vma, tmp);

@@ -483,17 +693,31 @@ out:
bool vrange_addr_volatile(struct vm_area_struct *vma, unsigned long addr)
{
struct vrange_root *vroot;
+ struct vrange *vrange;
unsigned long vstart_idx, vend_idx;
bool ret = false;

- vroot = __vma_to_vroot(vma);
+ vroot = __vma_to_vroot_get(vma);
+
+ if (!vroot)
+ return ret;
+
vstart_idx = __vma_addr_to_index(vma, addr);
vend_idx = vstart_idx + PAGE_SIZE - 1;

vrange_lock(vroot);
- if (__vrange_find(vroot, vstart_idx, vend_idx))
- ret = true;
+ vrange = __vrange_find(vroot, vstart_idx, vend_idx);
+ if (vrange) {
+ /*
+ * vroot can be allocated for another process in
+ * same period so let's check vroot's stability
+ */
+ if (likely(vroot == vrange->owner))
+ ret = true;
+ }
vrange_unlock(vroot);
+ __vroot_put(vroot);
+
return ret;
}

@@ -505,12 +729,16 @@ bool vrange_addr_purged(struct vm_area_struct *vma, unsigned long addr)
bool ret = false;

vroot = __vma_to_vroot(vma);
+ if (!vroot)
+ return false;
+
vstart_idx = __vma_addr_to_index(vma, addr);

vrange_lock(vroot);
range = __vrange_find(vroot, vstart_idx, vstart_idx);
if (range && range->purged)
ret = true;
+
vrange_unlock(vroot);
return ret;
}
@@ -538,6 +766,7 @@ static void try_to_discard_one(struct vrange_root *vroot, struct page *page,
pte_t pteval;
spinlock_t *ptl;

+ VM_BUG_ON(!vroot);
VM_BUG_ON(!PageLocked(page));

pte = page_check_address(page, mm, addr, &ptl, 0);
@@ -596,9 +825,11 @@ static int try_to_discard_anon_vpage(struct page *page)
anon_vma_interval_tree_foreach(avc, &anon_vma->rb_root, pgoff, pgoff) {
vma = avc->vma;
mm = vma->vm_mm;
- vroot = &mm->vroot;
- address = vma_address(page, vma);
+ vroot = __vma_to_vroot(vma);
+ if (!vroot)
+ continue;

+ address = vma_address(page, vma);
vrange_lock(vroot);
if (!__vrange_find(vroot, address, address + PAGE_SIZE - 1)) {
vrange_unlock(vroot);
@@ -625,7 +856,10 @@ static int try_to_discard_file_vpage(struct page *page)
if (!page->mapping)
return ret;

- vroot = &mapping->vroot;
+ vroot = mapping->vroot;
+ if (!vroot)
+ return ret;
+
vstart_idx = page->index << PAGE_SHIFT;

mutex_lock(&mapping->i_mmap_mutex);
@@ -901,6 +1135,17 @@ static int discard_vrange(struct vrange *vrange, unsigned long *nr_discard)
struct vrange_root *vroot;
vroot = vrange->owner;

+ vroot = vrange_get_vroot(vrange);
+ if (!vroot)
+ return 0;
+
+ /*
+ * Race of vrange->owner could happens with __vrange_remove
+ * but it's okay because subfunctions will check it again
+ */
+ if (vrange->owner == NULL)
+ goto out;
+
if (vroot->type == VRANGE_MM) {
struct mm_struct *mm = vroot->object;
ret = __discard_vrange_anon(mm, vrange, nr_discard);
@@ -909,6 +1154,8 @@ static int discard_vrange(struct vrange *vrange, unsigned long *nr_discard)
ret = __discard_vrange_file(mapping, vrange, nr_discard);
}

+out:
+ __vroot_put(vroot);
return ret;
}

--
1.7.9.5

2014-01-02 07:14:08

by Minchan Kim

[permalink] [raw]
Subject: [PATCH v10 14/16] vrange: Change purged with hint

struct vrange has a purged field which is just flag to express
the range was purged or not so what we need is just a bit.
It means it's too bloated.

This patch changes the name with hint so upcoming patch will use
other extra bitfield for other purpose.

Cc: Mel Gorman <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Cc: Michel Lespinasse <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: John Stultz <[email protected]>
Signed-off-by: Minchan Kim <[email protected]>
---
include/linux/vrange_types.h | 3 ++-
mm/vrange.c | 39 ++++++++++++++++++++++++++++++---------
2 files changed, 32 insertions(+), 10 deletions(-)

diff --git a/include/linux/vrange_types.h b/include/linux/vrange_types.h
index c4ef8b69a0a1..d42b0e7d7343 100644
--- a/include/linux/vrange_types.h
+++ b/include/linux/vrange_types.h
@@ -20,7 +20,8 @@ struct vrange_root {
struct vrange {
struct interval_tree_node node;
struct vrange_root *owner;
- int purged;
+ /* purged */
+ unsigned long hint;
struct list_head lru;
atomic_t refcount;
};
diff --git a/mm/vrange.c b/mm/vrange.c
index 4e0775b722af..df01c6b084bf 100644
--- a/mm/vrange.c
+++ b/mm/vrange.c
@@ -29,6 +29,24 @@ struct vrange_walker {
struct list_head *pagelist;
};

+#define VRANGE_PURGED_MARK 0
+
+void mark_purge(struct vrange *range)
+{
+ range->hint |= (1 << VRANGE_PURGED_MARK);
+}
+
+void clear_purge(struct vrange *range)
+{
+ range->hint &= ~(1 << VRANGE_PURGED_MARK);
+}
+
+bool vrange_purged(struct vrange *range)
+{
+ bool purged = range->hint & (1 << VRANGE_PURGED_MARK);
+ return purged;
+}
+
static inline unsigned long vrange_size(struct vrange *range)
{
return range->node.last + 1 - range->node.start;
@@ -217,7 +235,7 @@ static struct vrange *__vrange_alloc(gfp_t flags)
return vrange;

vrange->owner = NULL;
- vrange->purged = 0;
+ vrange->hint = 0;
INIT_LIST_HEAD(&vrange->lru);
atomic_set(&vrange->refcount, 1);

@@ -288,14 +306,17 @@ static inline void __vrange_set(struct vrange *range,
{
range->node.start = start_idx;
range->node.last = end_idx;
- range->purged = purged;
+ if (purged)
+ mark_purge(range);
+ else
+ clear_purge(range);
}

static inline void __vrange_resize(struct vrange *range,
unsigned long start_idx, unsigned long end_idx)
{
struct vrange_root *vroot = range->owner;
- bool purged = range->purged;
+ bool purged = vrange_purged(range);

__vrange_remove(range);
__vrange_lru_del(range);
@@ -341,7 +362,7 @@ static int vrange_add(struct vrange_root *vroot,

start_idx = min_t(unsigned long, start_idx, node->start);
end_idx = max_t(unsigned long, end_idx, node->last);
- purged |= range->purged;
+ purged |= vrange_purged(range);

__vrange_remove(range);
__vrange_put(range);
@@ -383,7 +404,7 @@ static int vrange_remove(struct vrange_root *vroot,
next = interval_tree_iter_next(node, start_idx, end_idx);
range = vrange_from_node(node);

- *purged |= range->purged;
+ *purged |= vrange_purged(range);

if (start_idx <= node->start && end_idx >= node->last) {
/* argumented range covers the range fully */
@@ -409,7 +430,7 @@ static int vrange_remove(struct vrange_root *vroot,
used_new = true;
__vrange_resize(range, node->start, start_idx - 1);
__vrange_set(new_range, end_idx + 1, last,
- range->purged);
+ vrange_purged(range));
__vrange_add(new_range, vroot);
break;
}
@@ -492,7 +513,7 @@ int vrange_fork(struct mm_struct *new_mm, struct mm_struct *old_mm)
if (!new_range)
goto fail;
__vrange_set(new_range, range->node.start,
- range->node.last, range->purged);
+ range->node.last, vrange_purged(range));
__vrange_add(new_range, new);

}
@@ -736,7 +757,7 @@ bool vrange_addr_purged(struct vm_area_struct *vma, unsigned long addr)

vrange_lock(vroot);
range = __vrange_find(vroot, vstart_idx, vstart_idx);
- if (range && range->purged)
+ if (range && vrange_purged(range))
ret = true;

vrange_unlock(vroot);
@@ -753,7 +774,7 @@ static void do_purge(struct vrange_root *vroot,
node = interval_tree_iter_first(&vroot->v_rb, start_idx, end_idx);
while (node) {
range = container_of(node, struct vrange, node);
- range->purged = true;
+ mark_purge(range);
node = interval_tree_iter_next(node, start_idx, end_idx);
}
}
--
1.7.9.5

2014-01-02 07:14:01

by Minchan Kim

[permalink] [raw]
Subject: [PATCH v10 10/16] vrange: Purging vrange-anon pages from shrinker

This patch provides the logic to discard anonymous vranges by
generating the page list for the volatile ranges setting the ptes
volatile, and discarding the pages.

Cc: Mel Gorman <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Cc: Michel Lespinasse <[email protected]>
Cc: Johannes Weiner <[email protected]>
Signed-off-by: John Stultz <[email protected]>
[jstultz: Code tweaks and commit log rewording]
Signed-off-by: Minchan Kim <[email protected]>
---
mm/vrange.c | 184 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 183 insertions(+), 1 deletion(-)

diff --git a/mm/vrange.c b/mm/vrange.c
index 4a52b7a05f9a..0fa669c56ab8 100644
--- a/mm/vrange.c
+++ b/mm/vrange.c
@@ -11,6 +11,8 @@
#include <linux/hugetlb.h>
#include "internal.h"
#include <linux/mmu_notifier.h>
+#include <linux/mm_inline.h>
+#include <linux/migrate.h>

static struct kmem_cache *vrange_cachep;

@@ -19,6 +21,11 @@ static struct vrange_list {
spinlock_t lock;
} vrange_list;

+struct vrange_walker {
+ struct vm_area_struct *vma;
+ struct list_head *pagelist;
+};
+
static inline unsigned long vrange_size(struct vrange *range)
{
return range->node.last + 1 - range->node.start;
@@ -682,11 +689,186 @@ static struct vrange *vrange_isolate(void)
return vrange;
}

-static int discard_vrange(struct vrange *vrange, unsigned long *nr_discard)
+static unsigned long discard_vrange_pagelist(struct list_head *page_list)
+{
+ struct page *page;
+ unsigned int nr_discard = 0;
+ LIST_HEAD(ret_pages);
+ LIST_HEAD(free_pages);
+
+ while (!list_empty(page_list)) {
+ int err;
+ page = list_entry(page_list->prev, struct page, lru);
+ list_del(&page->lru);
+ if (!trylock_page(page)) {
+ list_add(&page->lru, &ret_pages);
+ continue;
+ }
+
+ /*
+ * discard_vpage returns unlocked page if it
+ * is successful
+ */
+ err = discard_vpage(page);
+ if (err) {
+ unlock_page(page);
+ list_add(&page->lru, &ret_pages);
+ continue;
+ }
+
+ ClearPageActive(page);
+ list_add(&page->lru, &free_pages);
+ dec_zone_page_state(page, NR_ISOLATED_ANON);
+ nr_discard++;
+ }
+
+ free_hot_cold_page_list(&free_pages, 1);
+ list_splice(&ret_pages, page_list);
+ return nr_discard;
+}
+
+static void vrange_pte_entry(pte_t pteval, unsigned long address,
+ unsigned ptent_size, struct mm_walk *walk)
{
+ struct page *page;
+ struct vrange_walker *vw = walk->private;
+ struct vm_area_struct *vma = vw->vma;
+ struct list_head *pagelist = vw->pagelist;
+
+ if (pte_none(pteval))
+ return;
+
+ if (!pte_present(pteval))
+ return;
+
+ page = vm_normal_page(vma, address, pteval);
+ if (unlikely(!page))
+ return;
+
+ if (!PageLRU(page) || PageLocked(page))
+ return;
+
+ BUG_ON(PageCompound(page));
+
+ if (isolate_lru_page(page))
+ return;
+
+ list_add(&page->lru, pagelist);
+
+ VM_BUG_ON(page_is_file_cache(page));
+ inc_zone_page_state(page, NR_ISOLATED_ANON);
+}
+
+static int vrange_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
+ struct mm_walk *walk)
+{
+ struct vrange_walker *vw = walk->private;
+ struct vm_area_struct *uninitialized_var(vma);
+ pte_t *pte;
+ spinlock_t *ptl;
+
+ vma = vw->vma;
+ split_huge_page_pmd(vma, addr, pmd);
+ if (pmd_trans_unstable(pmd))
+ return 0;
+
+ pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
+ for (; addr != end; pte++, addr += PAGE_SIZE)
+ vrange_pte_entry(*pte, addr, PAGE_SIZE, walk);
+ pte_unmap_unlock(pte - 1, ptl);
+ cond_resched();
+
return 0;
}

+static unsigned long discard_vma_pages(struct mm_struct *mm,
+ struct vm_area_struct *vma, unsigned long start,
+ unsigned long end)
+{
+ unsigned long ret = 0;
+ LIST_HEAD(pagelist);
+ struct vrange_walker vw;
+ struct mm_walk vrange_walk = {
+ .pmd_entry = vrange_pte_range,
+ .mm = vma->vm_mm,
+ .private = &vw,
+ };
+
+ vw.pagelist = &pagelist;
+ vw.vma = vma;
+
+ walk_page_range(start, end, &vrange_walk);
+
+ if (!list_empty(&pagelist))
+ ret = discard_vrange_pagelist(&pagelist);
+
+ putback_lru_pages(&pagelist);
+ return ret;
+}
+
+/*
+ * vrange->owner isn't stable because caller doesn't hold vrange_lock
+ * so avoid touching vrange->owner.
+ */
+static int __discard_vrange_anon(struct mm_struct *mm, struct vrange *vrange,
+ unsigned long *ret_discard)
+{
+ struct vm_area_struct *vma;
+ unsigned long nr_discard = 0;
+ unsigned long start = vrange->node.start;
+ unsigned long end = vrange->node.last + 1;
+ int ret = 0;
+
+ /* It prevent to destroy vma when the process exist */
+ if (!atomic_inc_not_zero(&mm->mm_users))
+ return ret;
+
+ if (!down_read_trylock(&mm->mmap_sem)) {
+ mmput(mm);
+ ret = -EAGAIN;
+ goto out; /* this vrange could be retried */
+ }
+
+ vma = find_vma(mm, start);
+ if (!vma || (vma->vm_start >= end))
+ goto out_unlock;
+
+ for (; vma; vma = vma->vm_next) {
+ if (vma->vm_start >= end)
+ break;
+ BUG_ON(vma->vm_flags & (VM_SPECIAL|VM_LOCKED|VM_MIXEDMAP|
+ VM_HUGETLB));
+ cond_resched();
+ nr_discard += discard_vma_pages(mm, vma,
+ max_t(unsigned long, start, vma->vm_start),
+ min_t(unsigned long, end, vma->vm_end));
+ }
+out_unlock:
+ up_read(&mm->mmap_sem);
+ mmput(mm);
+ *ret_discard = nr_discard;
+out:
+ return ret;
+}
+
+static int discard_vrange(struct vrange *vrange, unsigned long *nr_discard)
+{
+ int ret = 0;
+ struct mm_struct *mm;
+ struct vrange_root *vroot;
+ vroot = vrange->owner;
+
+ /* TODO : handle VRANGE_FILE */
+ if (vroot->type != VRANGE_MM)
+ goto out;
+
+ mm = vroot->object;
+ ret = __discard_vrange_anon(mm, vrange, nr_discard);
+out:
+ return ret;
+}
+
+
#define VRANGE_SCAN_THRESHOLD (4 << 20)

unsigned long shrink_vrange(enum lru_list lru, struct lruvec *lruvec,
--
1.7.9.5

2014-01-02 07:13:59

by Minchan Kim

[permalink] [raw]
Subject: [PATCH v10 11/16] vrange: support shmem_purge_page

If VM discards volatile page of shmem/tmpfs, it should remove
exceptional swap entry from radix tree as well as page itself.

For it, this patch introduces shmem_purge_page and free_swap_and_
cache_locked which is needed because I don't want to add more
overhead in hot path(ex, zap_pte).

A later patch will use it.

Cc: Mel Gorman <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Cc: Michel Lespinasse <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: John Stultz <[email protected]>
Signed-off-by: Minchan Kim <[email protected]>
---
include/linux/shmem_fs.h | 1 +
include/linux/swap.h | 1 +
mm/shmem.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++
mm/swapfile.c | 37 +++++++++++++++++++++++++++++++++++++
mm/vrange.c | 2 ++
5 files changed, 87 insertions(+)

diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index 30aa0dc60d75..3df94fe5dfb9 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -53,6 +53,7 @@ extern void shmem_unlock_mapping(struct address_space *mapping);
extern struct page *shmem_read_mapping_page_gfp(struct address_space *mapping,
pgoff_t index, gfp_t gfp_mask);
extern void shmem_truncate_range(struct inode *inode, loff_t start, loff_t end);
+extern void shmem_purge_page(struct inode *inode, struct page *page);
extern int shmem_unuse(swp_entry_t entry, struct page *page);

static inline struct page *shmem_read_mapping_page(
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 197a7799b59c..fb9f6d1daf89 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -469,6 +469,7 @@ extern int swap_duplicate(swp_entry_t);
extern int swapcache_prepare(swp_entry_t);
extern void swap_free(swp_entry_t);
extern void swapcache_free(swp_entry_t, struct page *page);
+extern int free_swap_and_cache_locked(swp_entry_t);
extern int free_swap_and_cache(swp_entry_t);
extern int swap_type_of(dev_t, sector_t, struct block_device **);
extern unsigned int count_swap_pages(int, int);
diff --git a/mm/shmem.c b/mm/shmem.c
index 8297623fcaed..e3626f969e0f 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -441,6 +441,52 @@ void shmem_unlock_mapping(struct address_space *mapping)
}
}

+void shmem_purge_page(struct inode *inode, struct page *page)
+{
+ struct page *ret_page;
+ struct address_space *mapping = inode->i_mapping;
+ struct shmem_inode_info *info = SHMEM_I(inode);
+ pgoff_t indices;
+ long nr_swaps_freed = 0;
+ pgoff_t index = page->index;
+
+ VM_BUG_ON(page_mapped(page));
+ VM_BUG_ON(!PageLocked(page));
+
+ if (!shmem_find_get_pages_and_swap(mapping, index,
+ 1, &ret_page, &indices))
+ return;
+
+ index = indices;
+ mem_cgroup_uncharge_start();
+ if (radix_tree_exceptional_entry(ret_page)) {
+ int error;
+ spin_lock_irq(&mapping->tree_lock);
+ error = shmem_radix_tree_replace(mapping, index,
+ ret_page, NULL);
+ spin_unlock_irq(&mapping->tree_lock);
+ if (!error) {
+ swp_entry_t swap = radix_to_swp_entry(ret_page);
+ free_swap_and_cache_locked(swap);
+ }
+ } else {
+ if (page->mapping == mapping)
+ truncate_inode_page(mapping, ret_page);
+ put_page(ret_page);
+ }
+
+ mem_cgroup_uncharge_end();
+
+ spin_lock(&info->lock);
+ info->swapped -= nr_swaps_freed;
+ shmem_recalc_inode(inode);
+ spin_unlock(&info->lock);
+
+ /* Question: We should update? */
+ inode->i_ctime = inode->i_mtime = CURRENT_TIME;
+}
+EXPORT_SYMBOL_GPL(shmem_purge_page);
+
/*
* Remove range of pages and swap entries from radix tree, and free them.
* If !unfalloc, truncate or punch hole; if unfalloc, undo failed fallocate.
diff --git a/mm/swapfile.c b/mm/swapfile.c
index de7c904e52e5..5b1cb7461e52 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -998,6 +998,43 @@ int free_swap_and_cache(swp_entry_t entry)
return p != NULL;
}

+/*
+ * Same with free_swap_cache but user know in advance that page found
+ * from swapper_spaces is already locked so that we could remove the page
+ * from page cache safely.
+ */
+int free_swap_and_cache_locked(swp_entry_t entry)
+{
+ struct swap_info_struct *p;
+ struct page *page = NULL;
+
+ if (non_swap_entry(entry))
+ return 1;
+
+ p = swap_info_get(entry);
+ if (p) {
+ if (swap_entry_free(p, entry, 1) == SWAP_HAS_CACHE) {
+ page = find_get_page(swap_address_space(entry),
+ entry.val);
+ }
+ spin_unlock(&p->lock);
+ }
+
+ if (page) {
+ /*
+ * Not mapped elsewhere, or swap space full? Free it!
+ * Also recheck PageSwapCache now page is locked (above).
+ */
+ if (PageSwapCache(page) && !PageWriteback(page) &&
+ (!page_mapped(page) || vm_swap_full())) {
+ delete_from_swap_cache(page);
+ SetPageDirty(page);
+ }
+ page_cache_release(page);
+ }
+ return p != NULL;
+}
+
#ifdef CONFIG_HIBERNATION
/*
* Find the swap type that corresponds to given device (if any).
diff --git a/mm/vrange.c b/mm/vrange.c
index 0fa669c56ab8..ed89835bcff4 100644
--- a/mm/vrange.c
+++ b/mm/vrange.c
@@ -13,6 +13,7 @@
#include <linux/mmu_notifier.h>
#include <linux/mm_inline.h>
#include <linux/migrate.h>
+#include <linux/shmem_fs.h>

static struct kmem_cache *vrange_cachep;

@@ -638,6 +639,7 @@ static int try_to_discard_file_vpage(struct page *page)
}

VM_BUG_ON(page_mapped(page));
+ shmem_purge_page(mapping->host, page);
ret = 0;
out:
vrange_unlock(vroot);
--
1.7.9.5

2014-01-02 07:13:58

by Minchan Kim

[permalink] [raw]
Subject: [PATCH v10 09/16] vrange: Add core shrinking logic for swapless system

This patch adds the core volatile range shrinking logic needed to
allow volatile range purging to function on swapless systems
because current VM doesn't age anonymous pages in case of swapless
system.

Hook shrinking volatile pages logic into VM's reclaim path directly,
where is shrink_list which is called by kswapd and direct reclaim
everytime.

The issue I'd like to solve is that I'd like to keep volatile pages
if system are full of streaming cache pages so reclaim preference
is following as.

used-once stream -> volatile pages -> working set

For detecting used-once stream pages, I uses simple logic(ie,
DEF_PRIORITY - 2) which is used many place in reclaim path to
detect reclaim pressure but need more testing and we might need
tune konb for that.

Anyway, with it, we can reclaim volatile pages regardless of swap
if memory pressure is tight so that we could avoid out of memory kill
and heavy I/O for swapping out.

This patch does not wire in the specific range purging logic,
but that will be added in the following patches.

Cc: Mel Gorman <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Cc: Michel Lespinasse <[email protected]>
Cc: Johannes Weiner <[email protected]>
[jstultz: Renamed some functions and minor cleanups]
Signed-off-by: John Stultz <[email protected]>
Signed-off-by: Minchan Kim <[email protected]>
---
include/linux/swap.h | 41 +++++++++++
include/linux/vrange.h | 10 +++
include/linux/vrange_types.h | 2 +
mm/vmscan.c | 47 ++-----------
mm/vrange.c | 159 ++++++++++++++++++++++++++++++++++++++++--
5 files changed, 212 insertions(+), 47 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 39b3d4c6aec9..197a7799b59c 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -13,6 +13,47 @@
#include <linux/page-flags.h>
#include <asm/page.h>

+struct scan_control {
+ /* Incremented by the number of inactive pages that were scanned */
+ unsigned long nr_scanned;
+
+ /* Number of pages freed so far during a call to shrink_zones() */
+ unsigned long nr_reclaimed;
+
+ /* How many pages shrink_list() should reclaim */
+ unsigned long nr_to_reclaim;
+
+ unsigned long hibernation_mode;
+
+ /* This context's GFP mask */
+ gfp_t gfp_mask;
+
+ int may_writepage;
+
+ /* Can mapped pages be reclaimed? */
+ int may_unmap;
+
+ /* Can pages be swapped as part of reclaim? */
+ int may_swap;
+
+ int order;
+
+ /* Scan (total_size >> priority) pages at once */
+ int priority;
+
+ /*
+ * The memory cgroup that hit its limit and as a result is the
+ * primary target of this reclaim invocation.
+ */
+ struct mem_cgroup *target_mem_cgroup;
+
+ /*
+ * Nodemask of nodes allowed by the caller. If NULL, all nodes
+ * are scanned.
+ */
+ nodemask_t *nodemask;
+};
+
struct notifier_block;

struct bio;
diff --git a/include/linux/vrange.h b/include/linux/vrange.h
index d9ce2ec53a34..eba155a0263c 100644
--- a/include/linux/vrange.h
+++ b/include/linux/vrange.h
@@ -12,6 +12,8 @@
#define vrange_entry(ptr) \
container_of(ptr, struct vrange, node.rb)

+struct scan_control;
+
#ifdef CONFIG_MMU

static inline swp_entry_t make_vrange_entry(void)
@@ -57,6 +59,8 @@ int discard_vpage(struct page *page);
bool vrange_addr_volatile(struct vm_area_struct *vma, unsigned long addr);
extern bool vrange_addr_purged(struct vm_area_struct *vma,
unsigned long address);
+extern unsigned long shrink_vrange(enum lru_list lru, struct lruvec *lruvec,
+ struct scan_control *sc);
#else

static inline void vrange_root_init(struct vrange_root *vroot,
@@ -75,5 +79,11 @@ static inline bool vrange_addr_volatile(struct vm_area_struct *vma,
static inline int discard_vpage(struct page *page) { return 0 };
static inline bool vrange_addr_purged(struct vm_area_struct *vma,
unsigned long address);
+static inline unsigned long shrink_vrange(enum lru_list lru,
+ struct lruvec *lruvec, struct scan_control *sc)
+{
+ return 0;
+}
+
#endif
#endif /* _LINIUX_VRANGE_H */
diff --git a/include/linux/vrange_types.h b/include/linux/vrange_types.h
index 0d48b423dae2..d7d451cd50b6 100644
--- a/include/linux/vrange_types.h
+++ b/include/linux/vrange_types.h
@@ -20,6 +20,8 @@ struct vrange {
struct interval_tree_node node;
struct vrange_root *owner;
int purged;
+ struct list_head lru;
+ atomic_t refcount;
};
#endif

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 630723812ce3..d8f45af1ab84 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -56,47 +56,6 @@
#define CREATE_TRACE_POINTS
#include <trace/events/vmscan.h>

-struct scan_control {
- /* Incremented by the number of inactive pages that were scanned */
- unsigned long nr_scanned;
-
- /* Number of pages freed so far during a call to shrink_zones() */
- unsigned long nr_reclaimed;
-
- /* How many pages shrink_list() should reclaim */
- unsigned long nr_to_reclaim;
-
- unsigned long hibernation_mode;
-
- /* This context's GFP mask */
- gfp_t gfp_mask;
-
- int may_writepage;
-
- /* Can mapped pages be reclaimed? */
- int may_unmap;
-
- /* Can pages be swapped as part of reclaim? */
- int may_swap;
-
- int order;
-
- /* Scan (total_size >> priority) pages at once */
- int priority;
-
- /*
- * The memory cgroup that hit its limit and as a result is the
- * primary target of this reclaim invocation.
- */
- struct mem_cgroup *target_mem_cgroup;
-
- /*
- * Nodemask of nodes allowed by the caller. If NULL, all nodes
- * are scanned.
- */
- nodemask_t *nodemask;
-};
-
#define lru_to_page(_head) (list_entry((_head)->prev, struct page, lru))

#ifdef ARCH_HAS_PREFETCH
@@ -1806,6 +1765,12 @@ static int inactive_list_is_low(struct lruvec *lruvec, enum lru_list lru)
static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
struct lruvec *lruvec, struct scan_control *sc)
{
+ unsigned long nr_reclaimed;
+
+ nr_reclaimed = shrink_vrange(lru, lruvec, sc);
+ if (nr_reclaimed >= sc->nr_to_reclaim)
+ return nr_reclaimed;
+
if (is_active_lru(lru)) {
if (inactive_list_is_low(lruvec, lru))
shrink_active_list(nr_to_scan, lruvec, sc, lru);
diff --git a/mm/vrange.c b/mm/vrange.c
index f86ed33434d8..4a52b7a05f9a 100644
--- a/mm/vrange.c
+++ b/mm/vrange.c
@@ -10,13 +10,25 @@
#include <linux/rmap.h>
#include <linux/hugetlb.h>
#include "internal.h"
-#include <linux/swap.h>
#include <linux/mmu_notifier.h>

static struct kmem_cache *vrange_cachep;

+static struct vrange_list {
+ struct list_head list;
+ spinlock_t lock;
+} vrange_list;
+
+static inline unsigned long vrange_size(struct vrange *range)
+{
+ return range->node.last + 1 - range->node.start;
+}
+
static int __init vrange_init(void)
{
+ INIT_LIST_HEAD(&vrange_list.list);
+ spin_lock_init(&vrange_list.lock);
+
vrange_cachep = KMEM_CACHE(vrange, SLAB_PANIC);
return 0;
}
@@ -27,21 +39,65 @@ static struct vrange *__vrange_alloc(gfp_t flags)
struct vrange *vrange = kmem_cache_alloc(vrange_cachep, flags);
if (!vrange)
return vrange;
+
vrange->owner = NULL;
vrange->purged = 0;
+ INIT_LIST_HEAD(&vrange->lru);
+ atomic_set(&vrange->refcount, 1);
+
return vrange;
}

static void __vrange_free(struct vrange *range)
{
WARN_ON(range->owner);
+ WARN_ON(atomic_read(&range->refcount) != 0);
+ WARN_ON(!list_empty(&range->lru));
+
kmem_cache_free(vrange_cachep, range);
}

+static inline void __vrange_lru_add(struct vrange *range)
+{
+ spin_lock(&vrange_list.lock);
+ WARN_ON(!list_empty(&range->lru));
+ list_add(&range->lru, &vrange_list.list);
+ spin_unlock(&vrange_list.lock);
+}
+
+static inline void __vrange_lru_del(struct vrange *range)
+{
+ spin_lock(&vrange_list.lock);
+ if (!list_empty(&range->lru)) {
+ list_del_init(&range->lru);
+ WARN_ON(range->owner);
+ }
+ spin_unlock(&vrange_list.lock);
+}
+
static void __vrange_add(struct vrange *range, struct vrange_root *vroot)
{
range->owner = vroot;
interval_tree_insert(&range->node, &vroot->v_rb);
+
+ WARN_ON(atomic_read(&range->refcount) <= 0);
+ __vrange_lru_add(range);
+}
+
+static inline int __vrange_get(struct vrange *vrange)
+{
+ if (!atomic_inc_not_zero(&vrange->refcount))
+ return 0;
+
+ return 1;
+}
+
+static inline void __vrange_put(struct vrange *range)
+{
+ if (atomic_dec_and_test(&range->refcount)) {
+ __vrange_lru_del(range);
+ __vrange_free(range);
+ }
}

static void __vrange_remove(struct vrange *range)
@@ -66,6 +122,7 @@ static inline void __vrange_resize(struct vrange *range,
bool purged = range->purged;

__vrange_remove(range);
+ __vrange_lru_del(range);
__vrange_set(range, start_idx, end_idx, purged);
__vrange_add(range, vroot);
}
@@ -102,7 +159,7 @@ static int vrange_add(struct vrange_root *vroot,
range = vrange_from_node(node);
/* old range covers new range fully */
if (node->start <= start_idx && node->last >= end_idx) {
- __vrange_free(new_range);
+ __vrange_put(new_range);
goto out;
}

@@ -111,7 +168,7 @@ static int vrange_add(struct vrange_root *vroot,
purged |= range->purged;

__vrange_remove(range);
- __vrange_free(range);
+ __vrange_put(range);

node = next;
}
@@ -152,7 +209,7 @@ static int vrange_remove(struct vrange_root *vroot,
if (start_idx <= node->start && end_idx >= node->last) {
/* argumented range covers the range fully */
__vrange_remove(range);
- __vrange_free(range);
+ __vrange_put(range);
} else if (node->start >= start_idx) {
/*
* Argumented range covers over the left of the
@@ -183,7 +240,7 @@ static int vrange_remove(struct vrange_root *vroot,
vrange_unlock(vroot);

if (!used_new)
- __vrange_free(new_range);
+ __vrange_put(new_range);

return 0;
}
@@ -206,7 +263,7 @@ void vrange_root_cleanup(struct vrange_root *vroot)
while ((node = rb_first(&vroot->v_rb))) {
range = vrange_entry(node);
__vrange_remove(range);
- __vrange_free(range);
+ __vrange_put(range);
}
vrange_unlock(vroot);
}
@@ -605,3 +662,93 @@ int discard_vpage(struct page *page)

return 1;
}
+
+static struct vrange *vrange_isolate(void)
+{
+ struct vrange *vrange = NULL;
+ spin_lock(&vrange_list.lock);
+ while (!list_empty(&vrange_list.list)) {
+ vrange = list_entry(vrange_list.list.prev,
+ struct vrange, lru);
+ list_del_init(&vrange->lru);
+ /* vrange is going to destroy */
+ if (__vrange_get(vrange))
+ break;
+
+ vrange = NULL;
+ }
+
+ spin_unlock(&vrange_list.lock);
+ return vrange;
+}
+
+static int discard_vrange(struct vrange *vrange, unsigned long *nr_discard)
+{
+ return 0;
+}
+
+#define VRANGE_SCAN_THRESHOLD (4 << 20)
+
+unsigned long shrink_vrange(enum lru_list lru, struct lruvec *lruvec,
+ struct scan_control *sc)
+{
+ int retry = 10;
+ struct vrange *range;
+ unsigned long nr_to_reclaim, total_reclaimed = 0;
+ unsigned long long scan_threshold = VRANGE_SCAN_THRESHOLD;
+
+ if (!(sc->gfp_mask & __GFP_IO))
+ return 0;
+ /*
+ * In current implementation, VM discard volatile pages by
+ * following preference.
+ *
+ * stream pages -> volatile pages -> anon pages
+ *
+ * If we have trouble(ie, DEF_PRIORITY - 2) with reclaiming cache
+ * pages, it means remained cache pages is likely being working set
+ * so it would be better to discard volatile pages rather than
+ * evicting working set.
+ */
+ if (lru != LRU_INACTIVE_ANON && lru != LRU_ACTIVE_ANON &&
+ sc->priority >= DEF_PRIORITY - 2)
+ return 0;
+
+ nr_to_reclaim = sc->nr_to_reclaim;
+
+ while (nr_to_reclaim > 0 && scan_threshold > 0 && retry) {
+ unsigned long nr_reclaimed = 0;
+ int ret;
+
+ range = vrange_isolate();
+ /* If there is no more vrange, stop */
+ if (!range)
+ return total_reclaimed;
+
+ /* range is removing */
+ if (!range->owner) {
+ __vrange_put(range);
+ continue;
+ }
+
+ ret = discard_vrange(range, &nr_reclaimed);
+ scan_threshold -= vrange_size(range);
+
+ /* If it's EAGAIN, retry it after a little */
+ if (ret == -EAGAIN) {
+ retry--;
+ __vrange_lru_add(range);
+ __vrange_put(range);
+ continue;
+ }
+
+ __vrange_put(range);
+ retry = 10;
+
+ total_reclaimed += nr_reclaimed;
+ if (total_reclaimed >= nr_to_reclaim)
+ break;
+ }
+
+ return total_reclaimed;
+}
--
1.7.9.5

2014-01-27 22:23:40

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH v10 00/16] Volatile Ranges v10

Hi Minchan,


On Thu, Jan 2, 2014 at 2:12 AM, Minchan Kim <[email protected]> wrote:
> Hey all,
>
> Happy New Year!
>
> I know it's bad timing to send this unfamiliar large patchset for
> review but hope there are some guys with freshed-brain in new year
> all over the world. :)
> And most important thing is that before I dive into lots of testing,
> I'd like to make an agreement on design issues and others
>
> o Syscall interface
> o Not bind with vma split/merge logic to prevent mmap_sem cost and
> o Not bind with vma split/merge logic to avoid vm_area_struct memory
> footprint.
> o Purging logic - when we trigger purging volatile pages to prevent
> working set and stop to prevent too excessive purging of volatile
> pages
> o How to test
> Currently, we have a patched jemalloc allocator by Jason's help
> although it's not perfect and more rooms to be enhanced but IMO,
> it's enough to prove vrange-anonymous. The problem is that
> lack of benchmark for testing vrange-file side. I hope that
> Mozilla folks can help.
>
> So its been a while since the last release of the volatile ranges
> patches, again. I and John have been busy with other things.
> Still, we have been slowly chipping away at issues and differences
> trying to get a patchset that we both agree on.
>
> There's still a few issues, but we figured any further polishing of
> the patch series in private would be unproductive and it would be much
> better to send the patches out for review and comment and get some wider
> opinions.
>
> You could get full patchset by git
>
> git clone -b vrange-v10-rc5 --single-branch git://git.kernel.org/pub/scm/linux/kernel/git/minchan/linux.git

Brief comments.

- You should provide jemalloc patch too. Otherwise we cannot
understand what the your mesurement mean.
- Your number only claimed the effectiveness anon vrange, but not file vrange.
- Still, Nobody likes file vrange. At least nobody said explicitly on
the list. I don't ack file vrange part until
I fully convinced Pros/Cons. You need to persuade other MM guys if
you really think anon vrange is not
sufficient. (Maybe LSF is the best place)
- I wrote you need to put a mesurement current implementation vs
VMA-based implementation at several
previous iteration. Because You claimed fast, but no number and you
haven't yet. I guess the reason is
you don't have any access to large machine. If so, I'll offer it.
Plz collaborate with us.

Unfortunately, I'm very busy and I didn't have a chance to review your
latest patch yet. But I'll finish it until
mm summit. And, I'll show you guys how much this patch improve glibc malloc too.

I and glibc folks agreed we push vrange into glibc malloc.

https://sourceware.org/ml/libc-alpha/2013-12/msg00343.html

Even though, I still dislike some aspect of this patch. I'd like to
discuss and make better design decision
with you.

Thanks.


>
> In v10, there are some notable changes following as
>
> Whats new in v10:
> * Fix several bugs and build break
> * Add shmem_purge_page to correct purging shmem/tmpfs
> * Replace slab shrinker with direct hooked reclaim path
> * Optimize pte scanning by caching previous place
> * Reorder patch and tidy up Cc-list
> * Rebased on v3.12
> * Add vrange-anon test with jemalloc in Dhaval's test suite
> - https://github.com/volatile-ranges-test/vranges-test
> so, you could test any application with vrange-patched jemalloc by
> LD_PRELOAD but please keep in mind that it's just a prototype to
> prove vrange syscall concept so it has more rooms to optimize.
> So, please do not compare it with another allocator.
>
> Whats new in v9:
> * Updated to v3.11
> * Added vrange purging logic to purge anonymous pages on
> swapless systems
> * Added logic to allocate the vroot structure dynamically
> to avoid added overhead to mm and address_space structures
> * Lots of minor tweaks, changes and cleanups
>
> Still TODO:
> * Sort out better solution for clearing volatility on new mmaps
> - Minchan has a different approach here
> * Agreement of systemcall interface
> * Better discarding trigger policy to prevent working set evction
> * Review, Review, Review.. Comment.
> * A ton of test
>
> Feedback or thoughts here would be particularly helpful!
>
> Also, thanks to Dhaval for his maintaining and vastly improving
> the volatile ranges test suite, which can be found here:
> [1] https://github.com/volatile-ranges-test/vranges-test
>
> These patches can also be pulled from git here:
> git://git.linaro.org/people/jstultz/android-dev.git dev/vrange-v9
>
> We'd really welcome any feedback and comments on the patch series.

2014-01-27 22:43:51

by John Stultz

[permalink] [raw]
Subject: Re: [PATCH v10 00/16] Volatile Ranges v10

On 01/27/2014 02:23 PM, KOSAKI Motohiro wrote:
> Hi Minchan,
>
>
> On Thu, Jan 2, 2014 at 2:12 AM, Minchan Kim <[email protected]> wrote:
>> Hey all,
>>
>> Happy New Year!
>>
>> I know it's bad timing to send this unfamiliar large patchset for
>> review but hope there are some guys with freshed-brain in new year
>> all over the world. :)
>> And most important thing is that before I dive into lots of testing,
>> I'd like to make an agreement on design issues and others
>>
>> o Syscall interface
>> o Not bind with vma split/merge logic to prevent mmap_sem cost and
>> o Not bind with vma split/merge logic to avoid vm_area_struct memory
>> footprint.
>> o Purging logic - when we trigger purging volatile pages to prevent
>> working set and stop to prevent too excessive purging of volatile
>> pages
>> o How to test
>> Currently, we have a patched jemalloc allocator by Jason's help
>> although it's not perfect and more rooms to be enhanced but IMO,
>> it's enough to prove vrange-anonymous. The problem is that
>> lack of benchmark for testing vrange-file side. I hope that
>> Mozilla folks can help.
>>
>> So its been a while since the last release of the volatile ranges
>> patches, again. I and John have been busy with other things.
>> Still, we have been slowly chipping away at issues and differences
>> trying to get a patchset that we both agree on.
>>
>> There's still a few issues, but we figured any further polishing of
>> the patch series in private would be unproductive and it would be much
>> better to send the patches out for review and comment and get some wider
>> opinions.
>>
>> You could get full patchset by git
>>
>> git clone -b vrange-v10-rc5 --single-branch git://git.kernel.org/pub/scm/linux/kernel/git/minchan/linux.git
> Brief comments.
>
> - You should provide jemalloc patch too. Otherwise we cannot
> understand what the your mesurement mean.
> - Your number only claimed the effectiveness anon vrange, but not file vrange.
> - Still, Nobody likes file vrange. At least nobody said explicitly on
> the list. I don't ack file vrange part until
> I fully convinced Pros/Cons. You need to persuade other MM guys if
> you really think anon vrange is not
> sufficient. (Maybe LSF is the best place)

I do agree that the semantics for volatile-ranges on files is more
difficult for folks to grasp (and like after doing so). I've almost
gotten to the point (as I've discussed with Minchan privately) where I'm
willing to hold back on volatile-ranges on files in the shrort-term just
to see if it helps to get key mm folks to review and comment the
volatile-ranges on anonymous memory.

That said, I do think volatile ranges on files is an important concept,
and I'd like to make sure we don't design something that can't be used
for files in the future.

Part of the major interest in volatile memory has been from web
browsers. Both Chrome and Firefox are already making use of the
file-based ashmem, where available, in order to have this "discardable
memory" feature.

And while the Mozilla developers don't see file based volatile memory as
critical right now for their needs, I can imagine as they continue to
work on multi-process firefox
(http://billmccloskey.wordpress.com/2013/12/05/multiprocess-firefox/)
for performance and security reasons, the need to have memory volatility
shared between processes will become more important.


thanks
-john

2014-01-28 00:11:14

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v10 00/16] Volatile Ranges v10

Hey KOSAKI,

On Mon, Jan 27, 2014 at 05:23:17PM -0500, KOSAKI Motohiro wrote:
> Hi Minchan,
>
>
> On Thu, Jan 2, 2014 at 2:12 AM, Minchan Kim <[email protected]> wrote:
> > Hey all,
> >
> > Happy New Year!
> >
> > I know it's bad timing to send this unfamiliar large patchset for
> > review but hope there are some guys with freshed-brain in new year
> > all over the world. :)
> > And most important thing is that before I dive into lots of testing,
> > I'd like to make an agreement on design issues and others
> >
> > o Syscall interface
> > o Not bind with vma split/merge logic to prevent mmap_sem cost and
> > o Not bind with vma split/merge logic to avoid vm_area_struct memory
> > footprint.
> > o Purging logic - when we trigger purging volatile pages to prevent
> > working set and stop to prevent too excessive purging of volatile
> > pages
> > o How to test
> > Currently, we have a patched jemalloc allocator by Jason's help
> > although it's not perfect and more rooms to be enhanced but IMO,
> > it's enough to prove vrange-anonymous. The problem is that
> > lack of benchmark for testing vrange-file side. I hope that
> > Mozilla folks can help.
> >
> > So its been a while since the last release of the volatile ranges
> > patches, again. I and John have been busy with other things.
> > Still, we have been slowly chipping away at issues and differences
> > trying to get a patchset that we both agree on.
> >
> > There's still a few issues, but we figured any further polishing of
> > the patch series in private would be unproductive and it would be much
> > better to send the patches out for review and comment and get some wider
> > opinions.
> >
> > You could get full patchset by git
> >
> > git clone -b vrange-v10-rc5 --single-branch git://git.kernel.org/pub/scm/linux/kernel/git/minchan/linux.git
>
> Brief comments.
>
> - You should provide jemalloc patch too. Otherwise we cannot

I did. :) It seems you missed below in this description.
You could see it via following URL in Dhaval's test suite.

https://github.com/volatile-ranges-test/vranges-test/blob/master/0001-Implement-experimental-mvolatile-2-mnovolatile-2-sup.patch

Dhaval: Pz, could you merge patches John sent in your test suite?
I just pinged you.

But KOSAKI, pz, don't focus on jemalloc's implementaion.
It's not how jemalloc uses volatile ranges efficiently but just
one of example how to use volatile ranges.
I think volatile ranges could be really useful for garbage collection
of custom allocators(ex, In-memory DB, JVM, Dalvik, v8) as well as
general allocators.

> understand what the your mesurement mean.

> - Your number only claimed the effectiveness anon vrange, but not file vrange.

Yes. It's really problem as I said.
>From the beginning, John Stultz wanted to promote vrange-file to replace
android's ashmem and when I heard usecase of vrange-file, it does make sense
to me so that's why I'd like to unify them in a same interface.

But the problem is lack of interesting from others and lack of time to
test/evaluate it. I'm not an expert of userspace so actually I need a bit
help from them who require the feature but at a moment,
but I don't know who really want or/and help it.

Even, Android folks didn't have any interest on vrange-file.
So, we might drop vrange-file part in this patchset if it's really headache.
But let's discuss further because still I believe it's valuable feature to
keep instead of dropping.

I want that drop of vrange-file is really last resort to make forward
progress of vrange-anon.

> - Still, Nobody likes file vrange. At least nobody said explicitly on
> the list. I don't ack file vrange part until
> I fully convinced Pros/Cons. You need to persuade other MM guys if
> you really think anon vrange is not
> sufficient. (Maybe LSF is the best place)
> - I wrote you need to put a mesurement current implementation vs
> VMA-based implementation at several
> previous iteration. Because You claimed fast, but no number and you
> haven't yet. I guess the reason is

I did. :) Look at the number.
https://lkml.org/lkml/2013/10/8/63

The point is we need an mmap_sem's readside lock for vma handling(ex,
merge/split) and it's really bottlenect point for ebizzy which another
thread want to malloc(ie, mmap with new chunk requires mmap_sem's
write-side lock).

Additionally, some of user want to handle vrange fine-granularity(ex,
as worst case, PAGE_SIZE) so VMA handling would be really overhead
for us.

> you don't have any access to large machine. If so, I'll offer it.
> Plz collaborate with us.

Yes, Yes, Yes. That's what I want and you're really proper person to
collaborate. Pz, ping me if you're ready. :)

>
> Unfortunately, I'm very busy and I didn't have a chance to review your
> latest patch yet. But I'll finish it until
> mm summit. And, I'll show you guys how much this patch improve glibc malloc too.

Cool! It's really helpful for the work which I believe it's really
helpful feature for the Linux so I never want to drop this feature by just
lack of interesting of other MM guys who are very busy with NUMA/memcg stuff. :(

>
> I and glibc folks agreed we push vrange into glibc malloc.
>
> https://sourceware.org/ml/libc-alpha/2013-12/msg00343.html

Thanks for the info and recenlty ChromeOS people is looking into
volatile ranges so it seems there are so many interesting these days
so it would a good chance to make it work.

>
> Even though, I still dislike some aspect of this patch. I'd like to

That's true I need an many comment from MM commmuity so your input would
be really helpful.

> discuss and make better design decision
> with you.

KOSAKI,
Thanks for the your interest and suggestion for collaborating suggestion.

> Thanks.
>
>
> >
> > In v10, there are some notable changes following as
> >
> > Whats new in v10:
> > * Fix several bugs and build break
> > * Add shmem_purge_page to correct purging shmem/tmpfs
> > * Replace slab shrinker with direct hooked reclaim path
> > * Optimize pte scanning by caching previous place
> > * Reorder patch and tidy up Cc-list
> > * Rebased on v3.12
> > * Add vrange-anon test with jemalloc in Dhaval's test suite
> > - https://github.com/volatile-ranges-test/vranges-test
> > so, you could test any application with vrange-patched jemalloc by
> > LD_PRELOAD but please keep in mind that it's just a prototype to
> > prove vrange syscall concept so it has more rooms to optimize.
> > So, please do not compare it with another allocator.
> >
> > Whats new in v9:
> > * Updated to v3.11
> > * Added vrange purging logic to purge anonymous pages on
> > swapless systems
> > * Added logic to allocate the vroot structure dynamically
> > to avoid added overhead to mm and address_space structures
> > * Lots of minor tweaks, changes and cleanups
> >
> > Still TODO:
> > * Sort out better solution for clearing volatility on new mmaps
> > - Minchan has a different approach here
> > * Agreement of systemcall interface
> > * Better discarding trigger policy to prevent working set evction
> > * Review, Review, Review.. Comment.
> > * A ton of test
> >
> > Feedback or thoughts here would be particularly helpful!
> >
> > Also, thanks to Dhaval for his maintaining and vastly improving
> > the volatile ranges test suite, which can be found here:
> > [1] https://github.com/volatile-ranges-test/vranges-test
> >
> > These patches can also be pulled from git here:
> > git://git.linaro.org/people/jstultz/android-dev.git dev/vrange-v9
> >
> > We'd really welcome any feedback and comments on the patch series.
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

--
Kind regards,
Minchan Kim

2014-01-28 00:42:34

by John Stultz

[permalink] [raw]
Subject: Re: [PATCH v10 00/16] Volatile Ranges v10

On 01/27/2014 04:12 PM, Minchan Kim wrote:
> On Mon, Jan 27, 2014 at 05:23:17PM -0500, KOSAKI Motohiro wrote:
>> - Your number only claimed the effectiveness anon vrange, but not file vrange.
> Yes. It's really problem as I said.
> From the beginning, John Stultz wanted to promote vrange-file to replace
> android's ashmem and when I heard usecase of vrange-file, it does make sense
> to me so that's why I'd like to unify them in a same interface.
>
> But the problem is lack of interesting from others and lack of time to
> test/evaluate it. I'm not an expert of userspace so actually I need a bit
> help from them who require the feature but at a moment,
> but I don't know who really want or/and help it.
>
> Even, Android folks didn't have any interest on vrange-file.

Just as a correction here. I really don't think this is the case, as
Android's use definitely relies on file based volatility. It might be
more fair to say there hasn't been very much discussion from Android
developers on the particulars of the file volatility semantics (out
possibly not having any particular objections, or more-likely, being a
bit too busy to follow the all various theoretical tangents we've
discussed).

But I'd not want anyone to get the impression that anonymous-only
volatility would be sufficient for Android's needs.


(And to further clarify here, since this can be confusing...
shmem/tmpfs-only file volatility *would* be sufficient, despite that
technically being anonymous backed memory. The key issue is we need to
be able to share the volatility between processes.)


> So, we might drop vrange-file part in this patchset if it's really headache.
> But let's discuss further because still I believe it's valuable feature to
> keep instead of dropping.

If it helps gets interest in reviewing this, I'm ok with deferring
(tmpfs) file volatility, so folks can get comfortable with anonymous
volatility. But I worry its too critical a feature to ignore.

thanks
-john

2014-01-28 01:01:23

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v10 00/16] Volatile Ranges v10

On Mon, Jan 27, 2014 at 04:42:27PM -0800, John Stultz wrote:
> On 01/27/2014 04:12 PM, Minchan Kim wrote:
> > On Mon, Jan 27, 2014 at 05:23:17PM -0500, KOSAKI Motohiro wrote:
> >> - Your number only claimed the effectiveness anon vrange, but not file vrange.
> > Yes. It's really problem as I said.
> > From the beginning, John Stultz wanted to promote vrange-file to replace
> > android's ashmem and when I heard usecase of vrange-file, it does make sense
> > to me so that's why I'd like to unify them in a same interface.
> >
> > But the problem is lack of interesting from others and lack of time to
> > test/evaluate it. I'm not an expert of userspace so actually I need a bit
> > help from them who require the feature but at a moment,
> > but I don't know who really want or/and help it.
> >
> > Even, Android folks didn't have any interest on vrange-file.
>
> Just as a correction here. I really don't think this is the case, as
> Android's use definitely relies on file based volatility. It might be
> more fair to say there hasn't been very much discussion from Android
> developers on the particulars of the file volatility semantics (out
> possibly not having any particular objections, or more-likely, being a
> bit too busy to follow the all various theoretical tangents we've
> discussed).
>
> But I'd not want anyone to get the impression that anonymous-only
> volatility would be sufficient for Android's needs.

Right. Thanks for the correction.

>
>
> (And to further clarify here, since this can be confusing...
> shmem/tmpfs-only file volatility *would* be sufficient, despite that
> technically being anonymous backed memory. The key issue is we need to
> be able to share the volatility between processes.)
>
>
> > So, we might drop vrange-file part in this patchset if it's really headache.
> > But let's discuss further because still I believe it's valuable feature to
> > keep instead of dropping.
>
> If it helps gets interest in reviewing this, I'm ok with deferring
> (tmpfs) file volatility, so folks can get comfortable with anonymous
> volatility. But I worry its too critical a feature to ignore.

Yes. I don't want to drop it without more discussion with real user
of it but the problem is it's very hard to find one to have extra time
to discuss it.


>
> thanks
> -john
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

--
Kind regards,
Minchan Kim

2014-01-28 01:21:42

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v10 00/16] Volatile Ranges v10

On Mon, Jan 27, 2014 at 05:09:59PM -0800, Taras Glek wrote:
>
>
> John Stultz wrote:
> >On 01/27/2014 04:12 PM, Minchan Kim wrote:
> >>On Mon, Jan 27, 2014 at 05:23:17PM -0500, KOSAKI Motohiro wrote:
> >>>- Your number only claimed the effectiveness anon vrange, but not file vrange.
> >>Yes. It's really problem as I said.
> >> From the beginning, John Stultz wanted to promote vrange-file to replace
> >>android's ashmem and when I heard usecase of vrange-file, it does make sense
> >>to me so that's why I'd like to unify them in a same interface.
> >>
> >>But the problem is lack of interesting from others and lack of time to
> >>test/evaluate it. I'm not an expert of userspace so actually I need a bit
> >>help from them who require the feature but at a moment,
> >>but I don't know who really want or/and help it.
> >>
> >>Even, Android folks didn't have any interest on vrange-file.
> >
> >Just as a correction here. I really don't think this is the case, as
> >Android's use definitely relies on file based volatility. It might be
> >more fair to say there hasn't been very much discussion from Android
> >developers on the particulars of the file volatility semantics (out
> >possibly not having any particular objections, or more-likely, being a
> >bit too busy to follow the all various theoretical tangents we've
> >discussed).
> >
> >But I'd not want anyone to get the impression that anonymous-only
> >volatility would be sufficient for Android's needs.
> Mozilla is starting to use android's ashmem for discardable memory
> within a single process:
> https://bugzilla.mozilla.org/show_bug.cgi?id=748598 .
>
> Volatile ranges do help with that specific(uncommon?) use of ashmem.

Thanks for the info.

I'd like to ask a question.
Do you prefer fvrange(fd, offset, len) or fadvise(fd, offset, len, advise)
inteface rather than current vrange syscall interface for vrange-file?

Because I think it would remove unnecessary mmap/munmap syscall for vrange
interface as well as out of address space in 32bit machine.

>
> For Mozilla sharing memory across processes via ashmem is not a
> nearterm project. It's something that is likely to require
> significant rework. Process-local discardable memory can be
> retrofited in a more straight-forward fashion.
>
> Taras

--
Kind regards,
Minchan Kim

2014-01-29 00:04:47

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH v10 00/16] Volatile Ranges v10

Hello Minchan,

On Thu, Jan 02, 2014 at 04:12:08PM +0900, Minchan Kim wrote:
> Hey all,
>
> Happy New Year!
>
> I know it's bad timing to send this unfamiliar large patchset for
> review but hope there are some guys with freshed-brain in new year
> all over the world. :)
> And most important thing is that before I dive into lots of testing,
> I'd like to make an agreement on design issues and others
>
> o Syscall interface

Why do we need another syscall for this? Can't we extend madvise to
take MADV_VOLATILE, MADV_NONVOLATILE, and return -ENOMEM if something
in the range was purged?

> o Not bind with vma split/merge logic to prevent mmap_sem cost and
> o Not bind with vma split/merge logic to avoid vm_area_struct memory
> footprint.

VMAs are there to track attributes of memory ranges. Duplicating
large parts of their functionality and co-maintaining both structures
on create, destroy, split, and merge means duplicate code and complex
interactions.

1. You need to define semantics and coordinate what happens when the
vma underlying a volatile range changes.

Either you have to strictly co-maintain both range objects, or you
have weird behavior like volatily outliving a vma and then applying
to a separate vma created in its place.

Userspace won't get this right, and even in the kernel this is
error prone and adds a lot to the complexity of vma management.

2. If page reclaim discards a page from the upper end of a a range,
you mark the whole range as purged. If the user later marks the
lower half of the range as non-volatile, the syscall will report
purged=1 even though all requested pages are still there.

The only way to make these semantics clean is either

a) have vrange() return a range ID so that only full ranges can
later be marked non-volatile, or

b) remember individual page purges so that sub-range changes can
properly report them

I don't like a) much because it's somewhat arbitrarily more
restrictive than madvise, mprotect, mmap/munmap etc. And for b),
the straight-forward solution would be to put purge-cookies into
the page tables to properly report purges in subrange changes, but
that would be even more coordination between vmas, page tables, and
the ad-hoc vranges.

3. Page reclaim usually happens on individual pages until an
allocation can be satisfied, but the shrinker purges entire ranges.

Should it really take out an entire 1G volatile range even though 4
pages would have been enough to satisfy an allocation? Sure, we
assume a range represents an single "object" and userspace would
have to regenerate the whole thing with only one page missing, but
there is still a massive difference in page frees, faults, and
allocations.

There needs to be a *really* good argument why VMAs are not enough for
this purpose. I would really like to see anon volatility implemented
as a VMA attribute, and have regular reclaim decide based on rmap of
individual pages whether it needs to swap or purge. Something like
this:

MADV_VOLATILE:
split vma if necessary
set VM_VOLATILE

MADV_NONVOLATILE:
clear VM_VOLATILE
merge vma if possible
pte walk to check for pmd_purged()/pte_purged()
return any_purged

shrink_page_list():
if PageAnon:
if try_to_purge_anon():
page_lock_anon_vma_read()
anon_vma_interval_tree_foreach:
if vma->vm_flags & VM_VOLATILE:
lock page table
unmap page
set_pmd_purged() / set_pte_purged()
unlock page table
page_lock_anon_vma_read()
...
try to reclaim

> o Purging logic - when we trigger purging volatile pages to prevent
> working set and stop to prevent too excessive purging of volatile
> pages
> o How to test
> Currently, we have a patched jemalloc allocator by Jason's help
> although it's not perfect and more rooms to be enhanced but IMO,
> it's enough to prove vrange-anonymous. The problem is that
> lack of benchmark for testing vrange-file side. I hope that
> Mozilla folks can help.
>
> So its been a while since the last release of the volatile ranges
> patches, again. I and John have been busy with other things.
> Still, we have been slowly chipping away at issues and differences
> trying to get a patchset that we both agree on.
>
> There's still a few issues, but we figured any further polishing of
> the patch series in private would be unproductive and it would be much
> better to send the patches out for review and comment and get some wider
> opinions.
>
> You could get full patchset by git
>
> git clone -b vrange-v10-rc5 --single-branch git://git.kernel.org/pub/scm/linux/kernel/git/minchan/linux.git
>
> In v10, there are some notable changes following as
>
> Whats new in v10:
> * Fix several bugs and build break
> * Add shmem_purge_page to correct purging shmem/tmpfs
> * Replace slab shrinker with direct hooked reclaim path
> * Optimize pte scanning by caching previous place
> * Reorder patch and tidy up Cc-list
> * Rebased on v3.12
> * Add vrange-anon test with jemalloc in Dhaval's test suite
> - https://github.com/volatile-ranges-test/vranges-test
> so, you could test any application with vrange-patched jemalloc by
> LD_PRELOAD but please keep in mind that it's just a prototype to
> prove vrange syscall concept so it has more rooms to optimize.
> So, please do not compare it with another allocator.
>
> Whats new in v9:
> * Updated to v3.11
> * Added vrange purging logic to purge anonymous pages on
> swapless systems

We stopped scanning anon on swapless systems because anon needed swap
to be reclaimable. If we can reclaim anon without swap, we have to
start scanning anon again unconditionally. It makes no sense to me to
work around this optimization and implement a separate reclaim logic.

> The syscall interface is defined in patch [4/16] in this series, but
> briefly there are two ways to utilze the functionality:
>
> Explicit marking method:
> 1) Userland marks a range of memory that can be regenerated if necessary
> as volatile
> 2) Before accessing the memory again, userland marks the memroy as
> nonvolatile, and the kernel will provide notifcation if any pages in the
> range has been purged.
>
> Optimistic method:
> 1) Userland marks a large range of data as volatile
> 2) Userland continues to access the data as it needs.
> 3) If userland accesses a page that has been purged, the kernel will
> send a SIGBUS
> 4) Userspace can trap the SIGBUS, mark the afected pages as
> non-volatile, and refill the data as needed before continuing on

What happens if a pointer to volatile memory is passed to a syscall
and the fault happens inside copy_*_user()?

> Other details:
> The interface takes a range of memory, which can cover anonymous pages
> as well as mmapped file pages. In the case that the pages are from a
> shared mmapped file, the volatility set on those file pages is global.
> Thus much as writes to those pages are shared to other processes, pages
> marked volatile will be volatile to any other processes that have the
> file mapped as well. It is advised that processes coordinate when using
> volatile ranges on shared mappings (much as they must coordinate when
> writing to shared data). Any uncleared volatility on mmapped files will
> last until the the file is closed by all users (ie: volatility isn't
> persistent on disk).

Support for file pages are a very big deal and they seem to have had
an impact on many design decisions, but they are only mentioned on a
side note in this email.

The rationale behind volatile anon pages was that they are often used
as caches and that dropping them under pressure and regenerating the
cache contents later on was much faster than swapping.

But pages that are backed by an actual filesystem are "regenerated" by
reading the contents back from disk! What's the point of declaring
them volatile?

Shmem pages are a different story. They might be implemented by a
virtual filesystem, but they behave like anon pages when it comes to
reclaim and repopulation so the same rationale for volatility appies.

But a big aspect of anon volatility is communicating to userspace
whether *content* has been destroyed while in volatile state. Shmem
pages might not necessarily need this. The oft-cited example is the
message passing in a large circular buffer that is unused most of the
time. The sender would mark it non-volatile before writing, and the
receiver would mark it volatile again after reading. The writer can
later reuse any unreclaimed *memory*, but nobody is coming back for
the actual *contents* stored in there. This usecase would be
perfectly fine with an interface that simply clears the dirty bits of
a range of shmem pages (through mmap or fd). The writer would set the
pages non-volatile by dirtying them, whereas the reader would mark
them volatile again by clearing the dirty bits. Reclaim would simply
discard clean pages.

So I'm not convinced that the anon side needs to be that awkward, that
all filesystems need to be supported because of shmem, and that shmem
needs more than an interface to clear dirty bits.

2014-01-29 01:44:05

by John Stultz

[permalink] [raw]
Subject: Re: [PATCH v10 00/16] Volatile Ranges v10

On 01/28/2014 04:03 PM, Johannes Weiner wrote:
> Hello Minchan,
>
> On Thu, Jan 02, 2014 at 04:12:08PM +0900, Minchan Kim wrote:
>> Hey all,
>>
>> Happy New Year!
>>
>> I know it's bad timing to send this unfamiliar large patchset for
>> review but hope there are some guys with freshed-brain in new year
>> all over the world. :)
>> And most important thing is that before I dive into lots of testing,
>> I'd like to make an agreement on design issues and others
>>
>> o Syscall interface
> Why do we need another syscall for this? Can't we extend madvise to
> take MADV_VOLATILE, MADV_NONVOLATILE, and return -ENOMEM if something
> in the range was purged?

So the madvise interface is insufficient to provide the semantics
needed. Not so much for MADV_VOLATILE, but MADV_NONVOLATILE. For the
NONVOLATILE call, we have to atomically unmark the volatility status of
the byte range and provide the purge status, which informs the caller if
any of the data in the specified range was discarded (and thus needs to
be regenerated).

The problem is that by clearing the range, we may need to allocate
memory (possibly by splitting in an existing range segment into two),
which possibly could fail. Unfortunately this could happen after we've
modified the volatile state of part of that range. At this point we
can't just fail, because we've modified state and we also need to return
the purge status of the modified state.

Thus we seem to need a write-like interface, which returns the number of
bytes successfully manipulated. But we also have to return the purge
state, which we currently do via a argument pointer.

hpa suggested to create something like an madvise2 interface which would
provide the needed interface change, but would be a shared interface for
the new flags as well as the old (possibly allowing various flags to be
combined). I'm fine changing it (the interface has changed a number of
times already), but we really haven't seen much in the way of a deeper
review, so the current vrange syscall is mostly a placeholder to
demonstrate the functionality and hopefully spur discussion on the
deeper semantics of how volatile ranges should work.


>> o Not bind with vma split/merge logic to prevent mmap_sem cost and
>> o Not bind with vma split/merge logic to avoid vm_area_struct memory
>> footprint.
> VMAs are there to track attributes of memory ranges. Duplicating
> large parts of their functionality and co-maintaining both structures
> on create, destroy, split, and merge means duplicate code and complex
> interactions.
>
> 1. You need to define semantics and coordinate what happens when the
> vma underlying a volatile range changes.
>
> Either you have to strictly co-maintain both range objects, or you
> have weird behavior like volatily outliving a vma and then applying
> to a separate vma created in its place.

So indeed this is a difficult problem! My initial approach is simply
when any new mapping is made, we clear the volatility of the affected
process memory. Admittedly this has extra overhead and Minchan has an
alternative here (which I'm not totally sold on yet, but may be ok).
I'm almost convinced that for anonymous volatility, storing the
volatility in the vma would be ok, but Minchan is worried about the
performance overhead of the required locking for manipulating the vmas.

For file volatility, this is more complicated, because since the
volatility is shared, the ranges have to be tracked against the
address_space structure, and can't be stored in per-process vmas. So
this is partially why we've kept range trees hanging off of the mm and
address_spaces structures, since it allows the range manipulation logic
to be shared in both cases.


> Userspace won't get this right, and even in the kernel this is
> error prone and adds a lot to the complexity of vma management.
Not sure exactly I understand what you mean by "userspace won't get this
right" ?


>
> 2. If page reclaim discards a page from the upper end of a a range,
> you mark the whole range as purged. If the user later marks the
> lower half of the range as non-volatile, the syscall will report
> purged=1 even though all requested pages are still there.

To me this aspect is a non-ideal but acceptable result of the usage pattern.

Semantically, the hard rule would be we never report non-purged if pages
in a range were purged. Reporting purged when pages technically weren't
is not optimal but acceptable side effect of unmarking a sub-range. And
could be avoided by applications marking and unmarking objects consistently.


> The only way to make these semantics clean is either
>
> a) have vrange() return a range ID so that only full ranges can
> later be marked non-volatile, or
>
> b) remember individual page purges so that sub-range changes can
> properly report them
>
> I don't like a) much because it's somewhat arbitrarily more
> restrictive than madvise, mprotect, mmap/munmap etc.
Agreed on A.

> And for b),
> the straight-forward solution would be to put purge-cookies into
> the page tables to properly report purges in subrange changes, but
> that would be even more coordination between vmas, page tables, and
> the ad-hoc vranges.

And for B this would cause way too much overhead for the mark/unmark
operations, which have to be lightweight.


> 3. Page reclaim usually happens on individual pages until an
> allocation can be satisfied, but the shrinker purges entire ranges.
>
> Should it really take out an entire 1G volatile range even though 4
> pages would have been enough to satisfy an allocation? Sure, we
> assume a range represents an single "object" and userspace would
> have to regenerate the whole thing with only one page missing, but
> there is still a massive difference in page frees, faults, and
> allocations.

So the purging behavior has been a (lightly) contentious item. Some have
argued that if a page from a range is purged, we might as well purge the
entire range before purging any page from another range. This makes the
most sense from the usage model where a range is marked and not touched
until it is unmarked.

However, if the user is utilizing the SIGBUS behavior and continues to
access the volatile range, we would really ideally have only the cold
pages purged so that the application can continue and the system can
manage all the pages via the LRU.

Minchan has proposed having flags to set the volatility mode (_PARTIAL
or _FULL) to allow applications to state their preferred behavior, but I
still favor having global LRU behavior and purging things page by page
via normal reclaim. My opinion is due to the fact that full-range
purging causes purging to be least-recently-marked-volatile rather then
LRU, but if the ranges are not accessed while volatile, LRU should
approximate the least-recently-marked-volatile.

However there are more then just philosophical issues that complicate
things. On swapless systems, we don't age anonymous pages, so we don't
have a hook to purge volatile pages. So in that case we currently have
to use the shrinker and use the full-range purging behavior.

Adding anonymous aging for swapless systems might be able to help here,
but thats likely to be complicated as well. For now the dual approach
Minchan implemented (where the LRU can evict single pages, and the
shrinker can evict total ranges) seems like a reasonable compromise
while the functionality is reviewed.


> There needs to be a *really* good argument why VMAs are not enough for
> this purpose. I would really like to see anon volatility implemented
> as a VMA attribute, and have regular reclaim decide based on rmap of
> individual pages whether it needs to swap or purge. Something like
> this:


The performance aspect is the major issue. With the separate range tree,
the operations are close to O(log(#ranges)), which is really attractive.
Anything that changes it to O(#pages in range) would be problematic.
But I think I could mostly go along with this if it stays O(log(#vmas)),
but will let Minchan detail his objections (which are mostly around the
locking contention). Though a few notes on your proposal...


>
> MADV_VOLATILE:
> split vma if necessary
> set VM_VOLATILE
>
> MADV_NONVOLATILE:
also need a "split vma if necessary" here.
> clear VM_VOLATILE
> merge vma if possible
> pte walk to check for pmd_purged()/pte_purged()
So I think the pte walk would be way too costly. Its probably easier to
have a VM_PURGED or something on the vma that we set when we purge a
page, which would simplify the purge state handling.


> return any_purged
>
> shrink_page_list():
> if PageAnon:
> if try_to_purge_anon():
> page_lock_anon_vma_read()
> anon_vma_interval_tree_foreach:
> if vma->vm_flags & VM_VOLATILE:
> lock page table
> unmap page
> set_pmd_purged() / set_pte_purged()
> unlock page table
> page_lock_anon_vma_read()
> ...
> try to reclaim

Again, we'd have to sort out something here for swapless systems.


The only other issue with the using VMAs that confused me when I looked
at it earlier, was that I thought most vma operations will merge
adjacent vmas if the vma state is the same. Your example doesn't have
this issue since you're checking the pages on non-volatile operations,
but assuming we store the purge state in the vma, we wouldn't want to
merge vmas since that would result in two separate volatile (but
unpurged) ranges being merged, and then the increased likelyhood we'd
consider the entire thing purged.


>
>> o Purging logic - when we trigger purging volatile pages to prevent
>> working set and stop to prevent too excessive purging of volatile
>> pages
>> o How to test
>> Currently, we have a patched jemalloc allocator by Jason's help
>> although it's not perfect and more rooms to be enhanced but IMO,
>> it's enough to prove vrange-anonymous. The problem is that
>> lack of benchmark for testing vrange-file side. I hope that
>> Mozilla folks can help.
>>
>> So its been a while since the last release of the volatile ranges
>> patches, again. I and John have been busy with other things.
>> Still, we have been slowly chipping away at issues and differences
>> trying to get a patchset that we both agree on.
>>
>> There's still a few issues, but we figured any further polishing of
>> the patch series in private would be unproductive and it would be much
>> better to send the patches out for review and comment and get some wider
>> opinions.
>>
>> You could get full patchset by git
>>
>> git clone -b vrange-v10-rc5 --single-branch git://git.kernel.org/pub/scm/linux/kernel/git/minchan/linux.git
>>
>> In v10, there are some notable changes following as
>>
>> Whats new in v10:
>> * Fix several bugs and build break
>> * Add shmem_purge_page to correct purging shmem/tmpfs
>> * Replace slab shrinker with direct hooked reclaim path
>> * Optimize pte scanning by caching previous place
>> * Reorder patch and tidy up Cc-list
>> * Rebased on v3.12
>> * Add vrange-anon test with jemalloc in Dhaval's test suite
>> - https://github.com/volatile-ranges-test/vranges-test
>> so, you could test any application with vrange-patched jemalloc by
>> LD_PRELOAD but please keep in mind that it's just a prototype to
>> prove vrange syscall concept so it has more rooms to optimize.
>> So, please do not compare it with another allocator.
>>
>> Whats new in v9:
>> * Updated to v3.11
>> * Added vrange purging logic to purge anonymous pages on
>> swapless systems
> We stopped scanning anon on swapless systems because anon needed swap
> to be reclaimable. If we can reclaim anon without swap, we have to
> start scanning anon again unconditionally. It makes no sense to me to
> work around this optimization and implement a separate reclaim logic.

I'd personally prefer we move to that.. but I'm not sure if the
unconditional scanning would be considered too problematic?



>
>> The syscall interface is defined in patch [4/16] in this series, but
>> briefly there are two ways to utilze the functionality:
>>
>> Explicit marking method:
>> 1) Userland marks a range of memory that can be regenerated if necessary
>> as volatile
>> 2) Before accessing the memory again, userland marks the memroy as
>> nonvolatile, and the kernel will provide notifcation if any pages in the
>> range has been purged.
>>
>> Optimistic method:
>> 1) Userland marks a large range of data as volatile
>> 2) Userland continues to access the data as it needs.
>> 3) If userland accesses a page that has been purged, the kernel will
>> send a SIGBUS
>> 4) Userspace can trap the SIGBUS, mark the afected pages as
>> non-volatile, and refill the data as needed before continuing on
> What happens if a pointer to volatile memory is passed to a syscall
> and the fault happens inside copy_*_user()?

I'll have to look into that detail. Thanks for bringing it up.

I suspect it would be the same as if a pointer to mmapped file was
passed to a syscall and the file was truncated by another processes?



>
>> Other details:
>> The interface takes a range of memory, which can cover anonymous pages
>> as well as mmapped file pages. In the case that the pages are from a
>> shared mmapped file, the volatility set on those file pages is global.
>> Thus much as writes to those pages are shared to other processes, pages
>> marked volatile will be volatile to any other processes that have the
>> file mapped as well. It is advised that processes coordinate when using
>> volatile ranges on shared mappings (much as they must coordinate when
>> writing to shared data). Any uncleared volatility on mmapped files will
>> last until the the file is closed by all users (ie: volatility isn't
>> persistent on disk).
> Support for file pages are a very big deal and they seem to have had
> an impact on many design decisions, but they are only mentioned on a
> side note in this email.
>
> The rationale behind volatile anon pages was that they are often used
> as caches and that dropping them under pressure and regenerating the
> cache contents later on was much faster than swapping.
>
> But pages that are backed by an actual filesystem are "regenerated" by
> reading the contents back from disk! What's the point of declaring
> them volatile?
>
> Shmem pages are a different story. They might be implemented by a
> virtual filesystem, but they behave like anon pages when it comes to
> reclaim and repopulation so the same rationale for volatility appies.
Right. So file volatility is mostly interesting to me on tmpfs/shmem,
and your point about them being only sort of technically file pages is
true, it sort of depends on where you stand in the kernel as to if its
considered file or anonymous memory.

As for real-disk-backed file volatility, I'm not particularly interested
in that, and fine with losing it. However, some have expressed
theoretical interest that there may be cases where throwing the memory
away is faster then writing it back to disk, so it might have some value
there. But I don't have any concrete use cases that need it.

Really the *key* need for tmpfs/shmem file volatility is in order to
have volatility on shared memory.

> But a big aspect of anon volatility is communicating to userspace
> whether *content* has been destroyed while in volatile state. Shmem
> pages might not necessarily need this. The oft-cited example is the
> message passing in a large circular buffer that is unused most of the
> time. The sender would mark it non-volatile before writing, and the
> receiver would mark it volatile again after reading. The writer can
> later reuse any unreclaimed *memory*, but nobody is coming back for
> the actual *contents* stored in there. This usecase would be
> perfectly fine with an interface that simply clears the dirty bits of
> a range of shmem pages (through mmap or fd). The writer would set the
> pages non-volatile by dirtying them, whereas the reader would mark
> them volatile again by clearing the dirty bits. Reclaim would simply
> discard clean pages.

So while in that case, its unlikely anyone is going to be trying to
reuse contents of the volatile data. However, there may be other cases
where for example, a image cache is managed in a shared tmpfs segment
between a jpeg renderer and a web-browser, so there can be improved
isolation/sandboxing between the functionality. In that case, the
management of the volatility would be handled completely by the
web-browser process, but we'd want memory to be shared so we've have
zero-copy from the render.

In that case, the browser would want to be able mark chunks of shared
buffer volatile when it wasn't in use, and to unmark the range and
re-use it if it wasn't purged.

But maybe I'm not quite seeing what your suggesting here!


> So I'm not convinced that the anon side needs to be that awkward, that
> all filesystems need to be supported because of shmem, and that shmem
> needs more than an interface to clear dirty bits.

Minchan may also have a different opinion then me, but I think I can
compromise/agree with your first two points there (or atleast see about
giving your approach a shot). The last I'm not confident of, but please
expand if my examples above don't contradict your idea.

Johannes: Thanks so much for the review here! After not getting too much
feedback, its been hard to put additional effort into this work, despite
feeling that it is important. This is really motivating!

thanks
-john

2014-01-29 05:09:27

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v10 00/16] Volatile Ranges v10

Hi Hannes,

It's interesting timing, I posted this patch Yew Year's Day
and receives indepth design review Lunar New Year's Day. :)
It's almost 0-day review. :)

On Tue, Jan 28, 2014 at 07:03:59PM -0500, Johannes Weiner wrote:
> Hello Minchan,
>
> On Thu, Jan 02, 2014 at 04:12:08PM +0900, Minchan Kim wrote:
> > Hey all,
> >
> > Happy New Year!
> >
> > I know it's bad timing to send this unfamiliar large patchset for
> > review but hope there are some guys with freshed-brain in new year
> > all over the world. :)
> > And most important thing is that before I dive into lots of testing,
> > I'd like to make an agreement on design issues and others
> >
> > o Syscall interface
>
> Why do we need another syscall for this? Can't we extend madvise to

Yeb. I should have written the reason. Early versions in this patchset
had used madvise with VMA handling but it was terrible performance for
ebizzy workload by mmap_sem's downside lock due to merging/split VMA.
Even it was worse than old so I gave up the VMA approach.

You could see the difference.
https://lkml.org/lkml/2013/10/8/63

It might be not a good decision and someone might say that if mmap_sem
is really headache, please fix it at first but as you know well,
it's never simple problem.
I hope if better idea or final decision comes in(ex, let's hold
until someone fix mmap_sem scalability), I could follow that.

> take MADV_VOLATILE, MADV_NONVOLATILE, and return -ENOMEM if something
> in the range was purged?

In that case, -ENOMEM would have duplicated meaning "Purged" and "Out
of memory so failed in the middle of the system call processing" and
later could be a problem so we need to return value to indicate
how many bytes are succeeded so far so it means we need additional
out parameter. But yes, we can solve it by modifying semantic and
behavior (ex, as you said below, we could just unmark volatile
successfully if user pass (offset, len) consistent with marked volatile
ranges. (IOW, if we give up overlapping/subrange marking/unmakring
usecase. I expect it makes code simple further).
It's request from John so If he is okay, I'm no problem.

And there was another reason to make hard reusing madvise.

full puring VS partial purging

If someone should regenerate full object without
considering fault/alloc/zeroig(and many of people wanted it rather than
partial) he would want to VOLATILE_RANGE_FULL purging while
others(acutlly, only I) likes VOLATILE_RANGE_PARTIAL for
allocators(ie, vrange-anon) because it's fair for every processes
in allocator POV(fault + alloc + zeroing overhead).

It's not implemented yet in this patchset but I thought it is worth
discuss and if we want it, current madvise isn't enough.

>
> > o Not bind with vma split/merge logic to prevent mmap_sem cost and
> > o Not bind with vma split/merge logic to avoid vm_area_struct memory
> > footprint.
>
> VMAs are there to track attributes of memory ranges. Duplicating
> large parts of their functionality and co-maintaining both structures
> on create, destroy, split, and merge means duplicate code and complex
> interactions.
>
> 1. You need to define semantics and coordinate what happens when the
> vma underlying a volatile range changes.
>
> Either you have to strictly co-maintain both range objects, or you
> have weird behavior like volatily outliving a vma and then applying
> to a separate vma created in its place.
>
> Userspace won't get this right, and even in the kernel this is
> error prone and adds a lot to the complexity of vma management.

Current semantic is following as,
Vma handling logic in mm doesn't need to know vrange handling because
vrange's internal logic always checks validity of the vma but
one thing to do in vma logic is only clearing old volatile ranges
on creating new vma.
(Look at [PATCH v10 02/16] vrange: Clear volatility on new mmaps)
Acutally I don't like the idea and suggested following as.
https://git.kernel.org/cgit/linux/kernel/git/minchan/linux.git/commit/?h=vrange-working&id=821f58333b381fd88ee7f37fd9c472949756c74e
But John didn't like it. I guess if VMA size is really matter,
maybe we can embedded the flag into somewhere field of
vma(ex, vm_file LSB?)

Anyway, what I want to say is that vma/vrange co-maintaining
seem to be not bad.

>
> 2. If page reclaim discards a page from the upper end of a a range,
> you mark the whole range as purged. If the user later marks the
> lower half of the range as non-volatile, the syscall will report
> purged=1 even though all requested pages are still there.

True, The assumption is that basically, user should have a range
per object but we gives flexibility for user to handle subranges
of a volatile range so it might report false positive as you said.
In that case, please user can use mincore(2) for accuracy if he
want so he has flexiblity but lose performance a bit.
It's a tradeoff, IMO.

>
> The only way to make these semantics clean is either
>
> a) have vrange() return a range ID so that only full ranges can
> later be marked non-volatile, or

>
> b) remember individual page purges so that sub-range changes can
> properly report them
>
> I don't like a) much because it's somewhat arbitrarily more
> restrictive than madvise, mprotect, mmap/munmap etc. And for b),
> the straight-forward solution would be to put purge-cookies into
> the page tables to properly report purges in subrange changes, but
> that would be even more coordination between vmas, page tables, and
> the ad-hoc vranges.

Agree but I don't want to put a accuracy of defalut vrange syscall.
Page table lookup needs mmap_sem and O(N) cost so I'm afraid it would
make userland folks hesitant using this system call.
That's why I'd like to make accuarcy with mincore which is alreay costly
function.

>
> 3. Page reclaim usually happens on individual pages until an
> allocation can be satisfied, but the shrinker purges entire ranges.

Strictly speaking, not entire rangeS but entire A range.
This recent patchset bails out if we discard as much as VM want.

>
> Should it really take out an entire 1G volatile range even though 4
> pages would have been enough to satisfy an allocation? Sure, we
> assume a range represents an single "object" and userspace would
> have to regenerate the whole thing with only one page missing, but
> there is still a massive difference in page frees, faults, and
> allocations.

That's why I wanted to introduce full and partial purging flag as system
call argument.

>
> There needs to be a *really* good argument why VMAs are not enough for
> this purpose. I would really like to see anon volatility implemented

Strictly speaking, volatile ranges has two goals.

1. Avoid unncessary swapping or OOM if system has lots of volatile memory
2. Give advanced free rather than madvise(DONTNEED)

First goal is very clear so I don't need to say again
but it seems second goal isn't clear so that I try elaborate it a bit.

Current allocators really hates frequent calling munmap which is big
performance overhead if other threads are allocating new address
space or are faulting existing address space so they have used
madvise(DONTNEED) as optimization so at least, faulting threads
works well in parallel. It's better but allocators couldn't use
madvise(DONTNEED) frequently because it still prevent other thread's
allocation of new address space for a long time(becuase the overhead
of the system call is O(vma_size(vma)).

Volatile ranges system call never don't need to hold write-side mmap_sem
and the execution time is almost O(log(nr_range))) and if we follow your
suggestion(ie, vrange returns ID), it's O(1). It's better.

Another concern is that some of people want to handle range
fine-granularity, maybe worst case, PAGE_SIZE, in that case
so many VMA could be created if purging happens sparsely so it would
be really memory concern.

> as a VMA attribute, and have regular reclaim decide based on rmap of
> individual pages whether it needs to swap or purge. Something like
> this:
>
> MADV_VOLATILE:
> split vma if necessary
> set VM_VOLATILE
>
> MADV_NONVOLATILE:
> clear VM_VOLATILE
> merge vma if possible
> pte walk to check for pmd_purged()/pte_purged()
> return any_purged

It could make system call really slow so allocator people really
would be reluctant to use it.

>
> shrink_page_list():
> if PageAnon:
> if try_to_purge_anon():
> page_lock_anon_vma_read()
> anon_vma_interval_tree_foreach:
> if vma->vm_flags & VM_VOLATILE:
> lock page table
> unmap page
> set_pmd_purged() / set_pte_purged()
> unlock page table
> page_lock_anon_vma_read()
> ...
> try to reclaim
>
> > o Purging logic - when we trigger purging volatile pages to prevent
> > working set and stop to prevent too excessive purging of volatile
> > pages
> > o How to test
> > Currently, we have a patched jemalloc allocator by Jason's help
> > although it's not perfect and more rooms to be enhanced but IMO,
> > it's enough to prove vrange-anonymous. The problem is that
> > lack of benchmark for testing vrange-file side. I hope that
> > Mozilla folks can help.
> >
> > So its been a while since the last release of the volatile ranges
> > patches, again. I and John have been busy with other things.
> > Still, we have been slowly chipping away at issues and differences
> > trying to get a patchset that we both agree on.
> >
> > There's still a few issues, but we figured any further polishing of
> > the patch series in private would be unproductive and it would be much
> > better to send the patches out for review and comment and get some wider
> > opinions.
> >
> > You could get full patchset by git
> >
> > git clone -b vrange-v10-rc5 --single-branch git://git.kernel.org/pub/scm/linux/kernel/git/minchan/linux.git
> >
> > In v10, there are some notable changes following as
> >
> > Whats new in v10:
> > * Fix several bugs and build break
> > * Add shmem_purge_page to correct purging shmem/tmpfs
> > * Replace slab shrinker with direct hooked reclaim path
> > * Optimize pte scanning by caching previous place
> > * Reorder patch and tidy up Cc-list
> > * Rebased on v3.12
> > * Add vrange-anon test with jemalloc in Dhaval's test suite
> > - https://github.com/volatile-ranges-test/vranges-test
> > so, you could test any application with vrange-patched jemalloc by
> > LD_PRELOAD but please keep in mind that it's just a prototype to
> > prove vrange syscall concept so it has more rooms to optimize.
> > So, please do not compare it with another allocator.
> >
> > Whats new in v9:
> > * Updated to v3.11
> > * Added vrange purging logic to purge anonymous pages on
> > swapless systems
>
> We stopped scanning anon on swapless systems because anon needed swap
> to be reclaimable. If we can reclaim anon without swap, we have to
> start scanning anon again unconditionally. It makes no sense to me to
> work around this optimization and implement a separate reclaim logic.

If anyone don't object it, I really want it because it would make lots of
thing simple and acutally I implented it early version but the reason
I changed my mind is I don't want to put the unnecessary overhead
on the system which don't have any volatile pages on swapless.

Again said, if anyone don't object it, I want to enable forceful
anon page scanning on swapless system.

>
> > The syscall interface is defined in patch [4/16] in this series, but
> > briefly there are two ways to utilze the functionality:
> >
> > Explicit marking method:
> > 1) Userland marks a range of memory that can be regenerated if necessary
> > as volatile
> > 2) Before accessing the memory again, userland marks the memroy as
> > nonvolatile, and the kernel will provide notifcation if any pages in the
> > range has been purged.
> >
> > Optimistic method:
> > 1) Userland marks a large range of data as volatile
> > 2) Userland continues to access the data as it needs.
> > 3) If userland accesses a page that has been purged, the kernel will
> > send a SIGBUS
> > 4) Userspace can trap the SIGBUS, mark the afected pages as
> > non-volatile, and refill the data as needed before continuing on
>
> What happens if a pointer to volatile memory is passed to a syscall
> and the fault happens inside copy_*_user()?

If it was point purged page, page fault handler should send a SIGBUS.
Otherwise, it will allocate a page and no problem.
But there is more than what I imagined.
Do you have any curprit to be a problem? I will be happy to check it.

Thanks for the poiting out.

>
> > Other details:
> > The interface takes a range of memory, which can cover anonymous pages
> > as well as mmapped file pages. In the case that the pages are from a
> > shared mmapped file, the volatility set on those file pages is global.
> > Thus much as writes to those pages are shared to other processes, pages
> > marked volatile will be volatile to any other processes that have the
> > file mapped as well. It is advised that processes coordinate when using
> > volatile ranges on shared mappings (much as they must coordinate when
> > writing to shared data). Any uncleared volatility on mmapped files will
> > last until the the file is closed by all users (ie: volatility isn't
> > persistent on disk).
>
> Support for file pages are a very big deal and they seem to have had
> an impact on many design decisions, but they are only mentioned on a
> side note in this email.
>
> The rationale behind volatile anon pages was that they are often used
> as caches and that dropping them under pressure and regenerating the
> cache contents later on was much faster than swapping.

Additionally, making hinting system call very fast without negative effect
so we avoid unnecessary fault/alloc/page-zeroing.

Everything you said are issues I should mention in LSF/MM and discuss
with community and surely my description should have.

Thanks for the review and raising the issue, Hannes!

>
> But pages that are backed by an actual filesystem are "regenerated" by
> reading the contents back from disk! What's the point of declaring
> them volatile?
>
> Shmem pages are a different story. They might be implemented by a
> virtual filesystem, but they behave like anon pages when it comes to
> reclaim and repopulation so the same rationale for volatility appies.
>
> But a big aspect of anon volatility is communicating to userspace
> whether *content* has been destroyed while in volatile state. Shmem
> pages might not necessarily need this. The oft-cited example is the
> message passing in a large circular buffer that is unused most of the
> time. The sender would mark it non-volatile before writing, and the
> receiver would mark it volatile again after reading. The writer can
> later reuse any unreclaimed *memory*, but nobody is coming back for
> the actual *contents* stored in there. This usecase would be
> perfectly fine with an interface that simply clears the dirty bits of
> a range of shmem pages (through mmap or fd). The writer would set the
> pages non-volatile by dirtying them, whereas the reader would mark
> them volatile again by clearing the dirty bits. Reclaim would simply
> discard clean pages.
>
> So I'm not convinced that the anon side needs to be that awkward, that
> all filesystems need to be supported because of shmem, and that shmem
> needs more than an interface to clear dirty bits.
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

--
Kind regards,
Minchan Kim

2014-01-29 18:30:59

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH v10 00/16] Volatile Ranges v10

On Tue, Jan 28, 2014 at 05:43:54PM -0800, John Stultz wrote:
> On 01/28/2014 04:03 PM, Johannes Weiner wrote:
> > On Thu, Jan 02, 2014 at 04:12:08PM +0900, Minchan Kim wrote:
> >> o Syscall interface
> > Why do we need another syscall for this? Can't we extend madvise to
> > take MADV_VOLATILE, MADV_NONVOLATILE, and return -ENOMEM if something
> > in the range was purged?
>
> So the madvise interface is insufficient to provide the semantics
> needed. Not so much for MADV_VOLATILE, but MADV_NONVOLATILE. For the
> NONVOLATILE call, we have to atomically unmark the volatility status of
> the byte range and provide the purge status, which informs the caller if
> any of the data in the specified range was discarded (and thus needs to
> be regenerated).
>
> The problem is that by clearing the range, we may need to allocate
> memory (possibly by splitting in an existing range segment into two),
> which possibly could fail. Unfortunately this could happen after we've
> modified the volatile state of part of that range. At this point we
> can't just fail, because we've modified state and we also need to return
> the purge status of the modified state.

munmap() can theoretically fail for the same reason (splitting has to
allocate a new vma) but it's not even documented. The allocator does
not fail allocations of that order.

I'm not sure this is good enough, but to me it sounds a bit overkill
to design a new system call around a non-existent problem.

> >> o Not bind with vma split/merge logic to prevent mmap_sem cost and
> >> o Not bind with vma split/merge logic to avoid vm_area_struct memory
> >> footprint.
> > VMAs are there to track attributes of memory ranges. Duplicating
> > large parts of their functionality and co-maintaining both structures
> > on create, destroy, split, and merge means duplicate code and complex
> > interactions.
> >
> > 1. You need to define semantics and coordinate what happens when the
> > vma underlying a volatile range changes.
> >
> > Either you have to strictly co-maintain both range objects, or you
> > have weird behavior like volatily outliving a vma and then applying
> > to a separate vma created in its place.
>
> So indeed this is a difficult problem! My initial approach is simply
> when any new mapping is made, we clear the volatility of the affected
> process memory. Admittedly this has extra overhead and Minchan has an
> alternative here (which I'm not totally sold on yet, but may be ok).
> I'm almost convinced that for anonymous volatility, storing the
> volatility in the vma would be ok, but Minchan is worried about the
> performance overhead of the required locking for manipulating the vmas.
>
> For file volatility, this is more complicated, because since the
> volatility is shared, the ranges have to be tracked against the
> address_space structure, and can't be stored in per-process vmas. So
> this is partially why we've kept range trees hanging off of the mm and
> address_spaces structures, since it allows the range manipulation logic
> to be shared in both cases.

The fs people probably have not noticed yet what you've done to struct
address_space / struct inode ;-) I doubt that this is mergeable in its
current form, so we have to think about a separate mechanism for shmem
page ranges either way.

> > Userspace won't get this right, and even in the kernel this is
> > error prone and adds a lot to the complexity of vma management.
> Not sure exactly I understand what you mean by "userspace won't get this
> right" ?

I meant, userspace being responsible for keeping vranges coherent with
its mmap and munmap operations, instead of the kernel doing it.

> > 2. If page reclaim discards a page from the upper end of a a range,
> > you mark the whole range as purged. If the user later marks the
> > lower half of the range as non-volatile, the syscall will report
> > purged=1 even though all requested pages are still there.
>
> To me this aspect is a non-ideal but acceptable result of the usage pattern.
>
> Semantically, the hard rule would be we never report non-purged if pages
> in a range were purged. Reporting purged when pages technically weren't
> is not optimal but acceptable side effect of unmarking a sub-range. And
> could be avoided by applications marking and unmarking objects consistently.
>
>
> > The only way to make these semantics clean is either
> >
> > a) have vrange() return a range ID so that only full ranges can
> > later be marked non-volatile, or
> >
> > b) remember individual page purges so that sub-range changes can
> > properly report them
> >
> > I don't like a) much because it's somewhat arbitrarily more
> > restrictive than madvise, mprotect, mmap/munmap etc.
> Agreed on A.
>
> > And for b),
> > the straight-forward solution would be to put purge-cookies into
> > the page tables to properly report purges in subrange changes, but
> > that would be even more coordination between vmas, page tables, and
> > the ad-hoc vranges.
>
> And for B this would cause way too much overhead for the mark/unmark
> operations, which have to be lightweight.

Yes, and allocators/message passers truly don't need this because at
the time they set a region to volatile the contents are invalidated
and the non-volatile declaration doesn't give a hoot if content has
been destroyed.

But caches certainly would have to know if they should regenerate the
contents. And bigger areas should be using huge pages, so we'd check
in 2MB steps. Is this really more expensive than regenerating the
contents on a false positive?

MADV_NONVOLATILE and MADV_NONVOLATILE_REPORT? (catchy, I know...)

What worries me a bit is that we have started from the baseline that
anything that scales with range size is way too much overhead,
regardless of how awkward and alien the alternatives are to implement.
But even in its most direct implementation, marking discardable pages
one by one is still a massive improvement over thrashing cache or
swapping, so why do we have to start from such an extreme?

Applications won't use this interface because it's O(1), but because
they don't want to be #1 in memory consumption when the system hangs
and thrashes and swaps.

Obviously, the lighter the better, but this code just doesn't seem to
integrate at all into the VM and I don't think it's justified.

> > 3. Page reclaim usually happens on individual pages until an
> > allocation can be satisfied, but the shrinker purges entire ranges.
> >
> > Should it really take out an entire 1G volatile range even though 4
> > pages would have been enough to satisfy an allocation? Sure, we
> > assume a range represents an single "object" and userspace would
> > have to regenerate the whole thing with only one page missing, but
> > there is still a massive difference in page frees, faults, and
> > allocations.
>
> So the purging behavior has been a (lightly) contentious item. Some have
> argued that if a page from a range is purged, we might as well purge the
> entire range before purging any page from another range. This makes the
> most sense from the usage model where a range is marked and not touched
> until it is unmarked.

If the pages in a range were faulted in together, their LRU order will
reflect this.

> However, if the user is utilizing the SIGBUS behavior and continues to
> access the volatile range, we would really ideally have only the cold
> pages purged so that the application can continue and the system can
> manage all the pages via the LRU.
>
> Minchan has proposed having flags to set the volatility mode (_PARTIAL
> or _FULL) to allow applications to state their preferred behavior, but I
> still favor having global LRU behavior and purging things page by page
> via normal reclaim. My opinion is due to the fact that full-range
> purging causes purging to be least-recently-marked-volatile rather then
> LRU, but if the ranges are not accessed while volatile, LRU should
> approximate the least-recently-marked-volatile.

I'm with you on this. We could always isolate anon pages in vrange
clusters around the target LRU page, but the primary means of aging
and reclaiming anon memory should remain the LRU list scanner.

> However there are more then just philosophical issues that complicate
> things. On swapless systems, we don't age anonymous pages, so we don't
> have a hook to purge volatile pages. So in that case we currently have
> to use the shrinker and use the full-range purging behavior.
>
> Adding anonymous aging for swapless systems might be able to help here,
> but thats likely to be complicated as well. For now the dual approach
> Minchan implemented (where the LRU can evict single pages, and the
> shrinker can evict total ranges) seems like a reasonable compromise
> while the functionality is reviewed.

We should at least *try* to go back to aging anon and run some tests
to quantify the costs, before creating a new VM object with its own
separate ad-hoc aging on speculation! We've been there and it wasn't
that bad...

> > There needs to be a *really* good argument why VMAs are not enough for
> > this purpose. I would really like to see anon volatility implemented
> > as a VMA attribute, and have regular reclaim decide based on rmap of
> > individual pages whether it needs to swap or purge. Something like
> > this:
>
>
> The performance aspect is the major issue. With the separate range tree,
> the operations are close to O(log(#ranges)), which is really attractive.
> Anything that changes it to O(#pages in range) would be problematic.
> But I think I could mostly go along with this if it stays O(log(#vmas)),
> but will let Minchan detail his objections (which are mostly around the
> locking contention). Though a few notes on your proposal...
>
>
> >
> > MADV_VOLATILE:
> > split vma if necessary
> > set VM_VOLATILE
> >
> > MADV_NONVOLATILE:
> also need a "split vma if necessary" here.

True.

> > clear VM_VOLATILE
> > merge vma if possible
> > pte walk to check for pmd_purged()/pte_purged()
> So I think the pte walk would be way too costly. Its probably easier to
> have a VM_PURGED or something on the vma that we set when we purge a
> page, which would simplify the purge state handling.

I added it because, right now, userspace only knows about pages. It
does not know that when you mmap/madvise/mprotect that the kernel
creates range objects - that's an implementation detail - it only
knows that you are changing the attributes of a bunch of pages.

Reporting a result for a super-range of what you actually speficied in
the syscall would implicitely turn ranges into first class
user-visible VM objects.

This is a huge precedent. "The traditional Unix memory objects of
file mappings, shared memory segments, anon mappings, pages, bytes,
and vranges!" Is it really that strong of an object type? Does a
performance optimization of a single use case justify it?

I'd rather we make the reporting optional and report nothing in cases
where it's not needed, if that's all it takes.

And try very hard to stick with pages as the primary unit of paging.

> > return any_purged
> >
> > shrink_page_list():
> > if PageAnon:
> > if try_to_purge_anon():
> > page_lock_anon_vma_read()
> > anon_vma_interval_tree_foreach:
> > if vma->vm_flags & VM_VOLATILE:
> > lock page table
> > unmap page
> > set_pmd_purged() / set_pte_purged()
> > unlock page table
> > page_lock_anon_vma_read()
> > ...
> > try to reclaim
>
> Again, we'd have to sort out something here for swapless systems.

Do you mean aside from making sure anon is aged?

> >> Whats new in v9:
> >> * Updated to v3.11
> >> * Added vrange purging logic to purge anonymous pages on
> >> swapless systems
> > We stopped scanning anon on swapless systems because anon needed swap
> > to be reclaimable. If we can reclaim anon without swap, we have to
> > start scanning anon again unconditionally. It makes no sense to me to
> > work around this optimization and implement a separate reclaim logic.
>
> I'd personally prefer we move to that.. but I'm not sure if the
> unconditional scanning would be considered too problematic?

Quantifying the costs would be good, so that we know whether the
complexity of an out-of-band reclaim mechanism for anon pages is
justifiable. I doubt it, tbh.

> >> Optimistic method:
> >> 1) Userland marks a large range of data as volatile
> >> 2) Userland continues to access the data as it needs.
> >> 3) If userland accesses a page that has been purged, the kernel will
> >> send a SIGBUS
> >> 4) Userspace can trap the SIGBUS, mark the afected pages as
> >> non-volatile, and refill the data as needed before continuing on
> > What happens if a pointer to volatile memory is passed to a syscall
> > and the fault happens inside copy_*_user()?
>
> I'll have to look into that detail. Thanks for bringing it up.
>
> I suspect it would be the same as if a pointer to mmapped file was
> passed to a syscall and the file was truncated by another processes?

I think so. But doing that is kind of questionable to begin with,
whereas passing volatile pointers around would be a common and valid
thing to do.

> > Support for file pages are a very big deal and they seem to have had
> > an impact on many design decisions, but they are only mentioned on a
> > side note in this email.
> >
> > The rationale behind volatile anon pages was that they are often used
> > as caches and that dropping them under pressure and regenerating the
> > cache contents later on was much faster than swapping.
> >
> > But pages that are backed by an actual filesystem are "regenerated" by
> > reading the contents back from disk! What's the point of declaring
> > them volatile?
> >
> > Shmem pages are a different story. They might be implemented by a
> > virtual filesystem, but they behave like anon pages when it comes to
> > reclaim and repopulation so the same rationale for volatility appies.
> Right. So file volatility is mostly interesting to me on tmpfs/shmem,
> and your point about them being only sort of technically file pages is
> true, it sort of depends on where you stand in the kernel as to if its
> considered file or anonymous memory.

Yeah, there are many angles to look at it, but it's important that the
behavior is sane and consistent across all of them.

> As for real-disk-backed file volatility, I'm not particularly interested
> in that, and fine with losing it. However, some have expressed
> theoretical interest that there may be cases where throwing the memory
> away is faster then writing it back to disk, so it might have some value
> there. But I don't have any concrete use cases that need it.

That could also be covered with an interface that clears dirty bits in
a range of pages.

> Really the *key* need for tmpfs/shmem file volatility is in order to
> have volatility on shared memory.
>
> > But a big aspect of anon volatility is communicating to userspace
> > whether *content* has been destroyed while in volatile state. Shmem
> > pages might not necessarily need this. The oft-cited example is the
> > message passing in a large circular buffer that is unused most of the
> > time. The sender would mark it non-volatile before writing, and the
> > receiver would mark it volatile again after reading. The writer can
> > later reuse any unreclaimed *memory*, but nobody is coming back for
> > the actual *contents* stored in there. This usecase would be
> > perfectly fine with an interface that simply clears the dirty bits of
> > a range of shmem pages (through mmap or fd). The writer would set the
> > pages non-volatile by dirtying them, whereas the reader would mark
> > them volatile again by clearing the dirty bits. Reclaim would simply
> > discard clean pages.
>
> So while in that case, its unlikely anyone is going to be trying to
> reuse contents of the volatile data. However, there may be other cases
> where for example, a image cache is managed in a shared tmpfs segment
> between a jpeg renderer and a web-browser, so there can be improved
> isolation/sandboxing between the functionality. In that case, the
> management of the volatility would be handled completely by the
> web-browser process, but we'd want memory to be shared so we've have
> zero-copy from the render.
>
> In that case, the browser would want to be able mark chunks of shared
> buffer volatile when it wasn't in use, and to unmark the range and
> re-use it if it wasn't purged.

You are right, we probably can not ignore such cases.

The way I see it, the simplest design, the common denominator for
private anon, true file, shmem, tmpfs, would be for MADV/FADV_VOLATILE
to clear dirty bits off shared pages, or ptes/pmds in the private
mapped case to keep the COW charade intact. And for the NONVOLATILE
side to set dirty on what's still present and report if something is
missing.

Allocators and message passers don't care about content once volatile,
only about the memory. They wouldn't even have to go through the
non-volatile step anymore, they could just write to memory again and
it'll set the dirty bits and refault what's missing.

Caches in anon memory would have to mark the pages volatile and then
later non-volatile to see if the contents have been preserved.

In the standard configuration, this would exclude the optimistic
usecase, but it's conceivable to have a settable per-VMA flag that
prevents purged pages from refaulting silently and make them trip a
SIGBUS instead. But I'm still a little dubious whether this usecase
is workable in general...

Such an interface would be dead simple to use and consistent across
all types. The basic implementation would require only a couple of
lines of code, and while O(pages), it would still be much cheaper than
thrashing and swapping, and still cheaper than actively giving ranges
back to the kernel and reallocating and repopulating them later on.

Compare this to the diffstat of the current vrange implementation and
the complexity and inconsistencies it introduces into the VM. I'm not
sure an O(pages) interface would be unattractive enough to justify it.

Johannes

2014-01-31 01:27:34

by John Stultz

[permalink] [raw]
Subject: Re: [PATCH v10 00/16] Volatile Ranges v10

On 01/29/2014 10:30 AM, Johannes Weiner wrote:
> On Tue, Jan 28, 2014 at 05:43:54PM -0800, John Stultz wrote:
>> On 01/28/2014 04:03 PM, Johannes Weiner wrote:
>>> On Thu, Jan 02, 2014 at 04:12:08PM +0900, Minchan Kim wrote:
>>>> o Syscall interface
>>> Why do we need another syscall for this? Can't we extend madvise to
>>> take MADV_VOLATILE, MADV_NONVOLATILE, and return -ENOMEM if something
>>> in the range was purged?
>> So the madvise interface is insufficient to provide the semantics
>> needed. Not so much for MADV_VOLATILE, but MADV_NONVOLATILE. For the
>> NONVOLATILE call, we have to atomically unmark the volatility status of
>> the byte range and provide the purge status, which informs the caller if
>> any of the data in the specified range was discarded (and thus needs to
>> be regenerated).
>>
>> The problem is that by clearing the range, we may need to allocate
>> memory (possibly by splitting in an existing range segment into two),
>> which possibly could fail. Unfortunately this could happen after we've
>> modified the volatile state of part of that range. At this point we
>> can't just fail, because we've modified state and we also need to return
>> the purge status of the modified state.
> munmap() can theoretically fail for the same reason (splitting has to
> allocate a new vma) but it's not even documented. The allocator does
> not fail allocations of that order.
>
> I'm not sure this is good enough, but to me it sounds a bit overkill
> to design a new system call around a non-existent problem.

I still think its problematic design issue. With munmap, I think
re-calling on failure should be fine. But with _NONVOLATILE we could
possibly lose the purge status on a second call (for instance if only
the first page of memory was purged, but we errored out mid-call w/
ENOMEM, on the second call it will seem like the range was successfully
set non-volatile with no memory purged).

And even if the current allocator never ever fails, I worry at some
point in the future that rule might change and then we'd have a broken
interface.



>>>> o Not bind with vma split/merge logic to prevent mmap_sem cost and
>>>> o Not bind with vma split/merge logic to avoid vm_area_struct memory
>>>> footprint.
>>> VMAs are there to track attributes of memory ranges. Duplicating
>>> large parts of their functionality and co-maintaining both structures
>>> on create, destroy, split, and merge means duplicate code and complex
>>> interactions.
>>>
>>> 1. You need to define semantics and coordinate what happens when the
>>> vma underlying a volatile range changes.
>>>
>>> Either you have to strictly co-maintain both range objects, or you
>>> have weird behavior like volatily outliving a vma and then applying
>>> to a separate vma created in its place.
>> So indeed this is a difficult problem! My initial approach is simply
>> when any new mapping is made, we clear the volatility of the affected
>> process memory. Admittedly this has extra overhead and Minchan has an
>> alternative here (which I'm not totally sold on yet, but may be ok).
>> I'm almost convinced that for anonymous volatility, storing the
>> volatility in the vma would be ok, but Minchan is worried about the
>> performance overhead of the required locking for manipulating the vmas.
>>
>> For file volatility, this is more complicated, because since the
>> volatility is shared, the ranges have to be tracked against the
>> address_space structure, and can't be stored in per-process vmas. So
>> this is partially why we've kept range trees hanging off of the mm and
>> address_spaces structures, since it allows the range manipulation logic
>> to be shared in both cases.
> The fs people probably have not noticed yet what you've done to struct
> address_space / struct inode ;-) I doubt that this is mergeable in its
> current form, so we have to think about a separate mechanism for shmem
> page ranges either way.

Yea. But given the semantics will likely be *very* similar, it seems
strange to try to force separate mechanisms.

That said, in an earlier implementation I stored the range tree in a
hash so we wouldn't have to add anything to the address_space structure.
But for now I want to make it clear that the ranges are tied to the
address space (and it gives the fs folks something to notice ;).


>>> Userspace won't get this right, and even in the kernel this is
>>> error prone and adds a lot to the complexity of vma management.
>> Not sure exactly I understand what you mean by "userspace won't get this
>> right" ?
> I meant, userspace being responsible for keeping vranges coherent with
> its mmap and munmap operations, instead of the kernel doing it.
>
>>> 2. If page reclaim discards a page from the upper end of a a range,
>>> you mark the whole range as purged. If the user later marks the
>>> lower half of the range as non-volatile, the syscall will report
>>> purged=1 even though all requested pages are still there.
>> To me this aspect is a non-ideal but acceptable result of the usage pattern.
>>
>> Semantically, the hard rule would be we never report non-purged if pages
>> in a range were purged. Reporting purged when pages technically weren't
>> is not optimal but acceptable side effect of unmarking a sub-range. And
>> could be avoided by applications marking and unmarking objects consistently.
>>
>>
>>> The only way to make these semantics clean is either
>>>
>>> a) have vrange() return a range ID so that only full ranges can
>>> later be marked non-volatile, or
>>>
>>> b) remember individual page purges so that sub-range changes can
>>> properly report them
>>>
>>> I don't like a) much because it's somewhat arbitrarily more
>>> restrictive than madvise, mprotect, mmap/munmap etc.
>> Agreed on A.
>>
>>> And for b),
>>> the straight-forward solution would be to put purge-cookies into
>>> the page tables to properly report purges in subrange changes, but
>>> that would be even more coordination between vmas, page tables, and
>>> the ad-hoc vranges.
>> And for B this would cause way too much overhead for the mark/unmark
>> operations, which have to be lightweight.
> Yes, and allocators/message passers truly don't need this because at
> the time they set a region to volatile the contents are invalidated
> and the non-volatile declaration doesn't give a hoot if content has
> been destroyed.
>
> But caches certainly would have to know if they should regenerate the
> contents. And bigger areas should be using huge pages, so we'd check
> in 2MB steps. Is this really more expensive than regenerating the
> contents on a false positive?

So you make a good argument. I'd counter that the false-positives are
only caused when unmarking subranges of larger marked volatile range,
and for use cases that would care about regenerating the contents,
that's not a likely useage model (as they're probably going to be
marking objects in memory volatile/nonvolatile, not just arbitrary
ranges of pages).


> MADV_NONVOLATILE and MADV_NONVOLATILE_REPORT? (catchy, I know...)

Something like this might be doable. I suspect the non-reporting
non-volatile is more of a special case (reporting should probably be the
default - as its safer), so it should probably have the longer name.
But that's a minor issue.


> What worries me a bit is that we have started from the baseline that
> anything that scales with range size is way too much overhead,
> regardless of how awkward and alien the alternatives are to implement.
> But even in its most direct implementation, marking discardable pages
> one by one is still a massive improvement over thrashing cache or
> swapping, so why do we have to start from such an extreme?
>
> Applications won't use this interface because it's O(1), but because
> they don't want to be #1 in memory consumption when the system hangs
> and thrashes and swaps.
>
> Obviously, the lighter the better, but this code just doesn't seem to
> integrate at all into the VM and I don't think it's justified.

You're point about the implementation being somewhat alien is understood
(though again, there's not really a VMA like structure for files, so
tmpfs volatility I think will need something like this anyway) and at
this point I'm willing to give this a try. A fast design that gets
ignored is of less use then a slower one that gets reviewed and can be
merged. :)

That said, the whole premise here isn't in any single applications best
interest. We're basically asking applications to pledge donations of
memory, which will can only hurt their performance if the kernel take
it. This allows the entire system to run better, as more applications
can stay in memory, but if an application can get a 15% performance bump
by not donating memory and hoping the OOM killer will get some
other-app, that probably is hard to argue against. So I think lowering
the bar as much as possible so the "donations" minimally affect
performance is important for adoption (for example, think of the issue
w/ fsync on ext3).

But again, maybe that performance issue is something folks would be
willing to look into at a later point once the functionality is merged
and well understood?

>> However there are more then just philosophical issues that complicate
>> things. On swapless systems, we don't age anonymous pages, so we don't
>> have a hook to purge volatile pages. So in that case we currently have
>> to use the shrinker and use the full-range purging behavior.
>>
>> Adding anonymous aging for swapless systems might be able to help here,
>> but thats likely to be complicated as well. For now the dual approach
>> Minchan implemented (where the LRU can evict single pages, and the
>> shrinker can evict total ranges) seems like a reasonable compromise
>> while the functionality is reviewed.
> We should at least *try* to go back to aging anon and run some tests
> to quantify the costs, before creating a new VM object with its own
> separate ad-hoc aging on speculation! We've been there and it wasn't
> that bad...

Fair enough!


>>> clear VM_VOLATILE
>>> merge vma if possible
>>> pte walk to check for pmd_purged()/pte_purged()
>> So I think the pte walk would be way too costly. Its probably easier to
>> have a VM_PURGED or something on the vma that we set when we purge a
>> page, which would simplify the purge state handling.
> I added it because, right now, userspace only knows about pages. It
> does not know that when you mmap/madvise/mprotect that the kernel
> creates range objects - that's an implementation detail - it only
> knows that you are changing the attributes of a bunch of pages.
>
> Reporting a result for a super-range of what you actually speficied in
> the syscall would implicitely turn ranges into first class
> user-visible VM objects.
>
> This is a huge precedent. "The traditional Unix memory objects of
> file mappings, shared memory segments, anon mappings, pages, bytes,
> and vranges!" Is it really that strong of an object type? Does a
> performance optimization of a single use case justify it?
>
> I'd rather we make the reporting optional and report nothing in cases
> where it's not needed, if that's all it takes.
>
> And try very hard to stick with pages as the primary unit of paging.

I'd still think that ranges are still bunches of pages. But that by
allowing for the reporting of false positives (which applications have
to be able to handle anyway) in the unlikely case of unmarking a
sub-range of pages that were marked, we are able to make the
marking/unmarking operation much faster. Yes, behaviorally one can
intuit from this that there are these page-range objects managed by the
kernel, but I'd think its not really a first-order object.

But you have a strong argument, and I'm willing to concede the point
(this is your space, afterall :). However I do worry that by limiting
the flexibility of the semantics, we'll lock out some potential
performance optimizations later.


>>> Support for file pages are a very big deal and they seem to have had
>>> an impact on many design decisions, but they are only mentioned on a
>>> side note in this email.
>>>
>>> The rationale behind volatile anon pages was that they are often used
>>> as caches and that dropping them under pressure and regenerating the
>>> cache contents later on was much faster than swapping.
>>>
>>> But pages that are backed by an actual filesystem are "regenerated" by
>>> reading the contents back from disk! What's the point of declaring
>>> them volatile?
>>>
>>> Shmem pages are a different story. They might be implemented by a
>>> virtual filesystem, but they behave like anon pages when it comes to
>>> reclaim and repopulation so the same rationale for volatility appies.
>> Right. So file volatility is mostly interesting to me on tmpfs/shmem,
>> and your point about them being only sort of technically file pages is
>> true, it sort of depends on where you stand in the kernel as to if its
>> considered file or anonymous memory.
> Yeah, there are many angles to look at it, but it's important that the
> behavior is sane and consistent across all of them.

Agreed.


>> As for real-disk-backed file volatility, I'm not particularly interested
>> in that, and fine with losing it. However, some have expressed
>> theoretical interest that there may be cases where throwing the memory
>> away is faster then writing it back to disk, so it might have some value
>> there. But I don't have any concrete use cases that need it.
> That could also be covered with an interface that clears dirty bits in
> a range of pages.

Well.. possibly. The issues with real files are ugly, since you don't
want have stale data show up after the purge. In the past I proposed the
hole punching for this, but that could be just as costly then writing
the data back.

Practically, I really don't see how true-file volatility makes any sense.

The only rational interest was made by Dave Chinner, but what he really
wanted was totally different. Something like a file-system persistent
(instead of in-memory) volatility, so that filesystems could pick chunks
of files to purge when disk space got tight.


>> Really the *key* need for tmpfs/shmem file volatility is in order to
>> have volatility on shared memory.
>>
>>> But a big aspect of anon volatility is communicating to userspace
>>> whether *content* has been destroyed while in volatile state. Shmem
>>> pages might not necessarily need this. The oft-cited example is the
>>> message passing in a large circular buffer that is unused most of the
>>> time. The sender would mark it non-volatile before writing, and the
>>> receiver would mark it volatile again after reading. The writer can
>>> later reuse any unreclaimed *memory*, but nobody is coming back for
>>> the actual *contents* stored in there. This usecase would be
>>> perfectly fine with an interface that simply clears the dirty bits of
>>> a range of shmem pages (through mmap or fd). The writer would set the
>>> pages non-volatile by dirtying them, whereas the reader would mark
>>> them volatile again by clearing the dirty bits. Reclaim would simply
>>> discard clean pages.
>> So while in that case, its unlikely anyone is going to be trying to
>> reuse contents of the volatile data. However, there may be other cases
>> where for example, a image cache is managed in a shared tmpfs segment
>> between a jpeg renderer and a web-browser, so there can be improved
>> isolation/sandboxing between the functionality. In that case, the
>> management of the volatility would be handled completely by the
>> web-browser process, but we'd want memory to be shared so we've have
>> zero-copy from the render.
>>
>> In that case, the browser would want to be able mark chunks of shared
>> buffer volatile when it wasn't in use, and to unmark the range and
>> re-use it if it wasn't purged.
> You are right, we probably can not ignore such cases.
>
> The way I see it, the simplest design, the common denominator for
> private anon, true file, shmem, tmpfs, would be for MADV/FADV_VOLATILE
> to clear dirty bits off shared pages, or ptes/pmds in the private
> mapped case to keep the COW charade intact. And for the NONVOLATILE
> side to set dirty on what's still present and report if something is
> missing.

Hrmmm. This sounds reasonable, but I'm not sure its right. It seems the
missing part here is (with anonymous scanning on swapless systems), we
still have to set the volatility of the page somewhere so the LRU
scanner will actually purge those made-clean pages, no? Otherwise won't
anon and tmpfs files either just be swapped or passed over and left in
memory? And again, for true-files, we don't want stale data.


> Allocators and message passers don't care about content once volatile,
> only about the memory. They wouldn't even have to go through the
> non-volatile step anymore, they could just write to memory again and
> it'll set the dirty bits and refault what's missing.

So this would implicitly make any write effectively clear the volatility
of a page? That probably would be an ok semantic with the use cases I'm
aware of, but its new.


> Caches in anon memory would have to mark the pages volatile and then
> later non-volatile to see if the contents have been preserved.
>
> In the standard configuration, this would exclude the optimistic
> usecase, but it's conceivable to have a settable per-VMA flag that
> prevents purged pages from refaulting silently and make them trip a
> SIGBUS instead. But I'm still a little dubious whether this usecase
> is workable in general...

So the SIGBUS bit is of particular interest to the Mozilla folks. My
understanding of the usage case they want to have is expanding
compressed library files into memory, but allowing the cold library
pages to be purged. Then on access, any purged pages triggers the
SIGBUS, which allows them to "fault" back in that page from the
compressed file.

Similarly I can see folks wanting to use the optimistic model for other
use cases in general, especially if we go with a O(#pages) algorithm, as
it allows even less overhead by avoiding umarking and remarking pages on
access.

That said, the SIGBUS model is a bit painful as properly handling
signals in a large application is treacherous. So alternative solutions
for what is bascially userland page-faulting would be of interest.



> Such an interface would be dead simple to use and consistent across
> all types. The basic implementation would require only a couple of
> lines of code, and while O(pages), it would still be much cheaper than
> thrashing and swapping, and still cheaper than actively giving ranges
> back to the kernel and reallocating and repopulating them later on.
>
> Compare this to the diffstat of the current vrange implementation and
> the complexity and inconsistencies it introduces into the VM. I'm not
> sure an O(pages) interface would be unattractive enough to justify it.

Ok. So I think we're in agreement with:
* Moving the volatility state for anonymous volatility into the VMA
* Getting anonymous scanning going again on swapless systems

I'm still not totally sure about, but willing to try
* Page granular volatile tracking

I'm still not convinced on:
* madvise as a sufficient interface


I'll try to work out a draft of what your proposing (probably just for
anonymous memory for now) and we can iterate from there?

thanks
-john

2014-01-31 06:16:04

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH v10 00/16] Volatile Ranges v10

On Thu, Jan 30, 2014 at 05:27:18PM -0800, John Stultz wrote:
> On 01/29/2014 10:30 AM, Johannes Weiner wrote:
> > On Tue, Jan 28, 2014 at 05:43:54PM -0800, John Stultz wrote:
> >> On 01/28/2014 04:03 PM, Johannes Weiner wrote:
> >>> On Thu, Jan 02, 2014 at 04:12:08PM +0900, Minchan Kim wrote:
> >>>> o Syscall interface
> >>> Why do we need another syscall for this? Can't we extend madvise to
> >>> take MADV_VOLATILE, MADV_NONVOLATILE, and return -ENOMEM if something
> >>> in the range was purged?
> >> So the madvise interface is insufficient to provide the semantics
> >> needed. Not so much for MADV_VOLATILE, but MADV_NONVOLATILE. For the
> >> NONVOLATILE call, we have to atomically unmark the volatility status of
> >> the byte range and provide the purge status, which informs the caller if
> >> any of the data in the specified range was discarded (and thus needs to
> >> be regenerated).
> >>
> >> The problem is that by clearing the range, we may need to allocate
> >> memory (possibly by splitting in an existing range segment into two),
> >> which possibly could fail. Unfortunately this could happen after we've
> >> modified the volatile state of part of that range. At this point we
> >> can't just fail, because we've modified state and we also need to return
> >> the purge status of the modified state.
> > munmap() can theoretically fail for the same reason (splitting has to
> > allocate a new vma) but it's not even documented. The allocator does
> > not fail allocations of that order.
> >
> > I'm not sure this is good enough, but to me it sounds a bit overkill
> > to design a new system call around a non-existent problem.
>
> I still think its problematic design issue. With munmap, I think
> re-calling on failure should be fine. But with _NONVOLATILE we could
> possibly lose the purge status on a second call (for instance if only
> the first page of memory was purged, but we errored out mid-call w/
> ENOMEM, on the second call it will seem like the range was successfully
> set non-volatile with no memory purged).
>
> And even if the current allocator never ever fails, I worry at some
> point in the future that rule might change and then we'd have a broken
> interface.

Fair enough, we don't have to paint ourselves into a corner.

> >>> 2. If page reclaim discards a page from the upper end of a a range,
> >>> you mark the whole range as purged. If the user later marks the
> >>> lower half of the range as non-volatile, the syscall will report
> >>> purged=1 even though all requested pages are still there.
> >> To me this aspect is a non-ideal but acceptable result of the usage pattern.
> >>
> >> Semantically, the hard rule would be we never report non-purged if pages
> >> in a range were purged. Reporting purged when pages technically weren't
> >> is not optimal but acceptable side effect of unmarking a sub-range. And
> >> could be avoided by applications marking and unmarking objects consistently.
> >>
> >>
> >>> The only way to make these semantics clean is either
> >>>
> >>> a) have vrange() return a range ID so that only full ranges can
> >>> later be marked non-volatile, or
> >>>
> >>> b) remember individual page purges so that sub-range changes can
> >>> properly report them
> >>>
> >>> I don't like a) much because it's somewhat arbitrarily more
> >>> restrictive than madvise, mprotect, mmap/munmap etc.
> >> Agreed on A.
> >>
> >>> And for b),
> >>> the straight-forward solution would be to put purge-cookies into
> >>> the page tables to properly report purges in subrange changes, but
> >>> that would be even more coordination between vmas, page tables, and
> >>> the ad-hoc vranges.
> >> And for B this would cause way too much overhead for the mark/unmark
> >> operations, which have to be lightweight.
> > Yes, and allocators/message passers truly don't need this because at
> > the time they set a region to volatile the contents are invalidated
> > and the non-volatile declaration doesn't give a hoot if content has
> > been destroyed.
> >
> > But caches certainly would have to know if they should regenerate the
> > contents. And bigger areas should be using huge pages, so we'd check
> > in 2MB steps. Is this really more expensive than regenerating the
> > contents on a false positive?
>
> So you make a good argument. I'd counter that the false-positives are
> only caused when unmarking subranges of larger marked volatile range,
> and for use cases that would care about regenerating the contents,
> that's not a likely useage model (as they're probably going to be
> marking objects in memory volatile/nonvolatile, not just arbitrary
> ranges of pages).

I can imagine that applications have continuous areas of same-sized
objects and want to mark a whole range of them volatile in one go,
then later come back for individual objects.

Otherwise we'd require N adjacent objects to be marked individually
through N syscalls to create N separate internal ranges, or they'd get
strange and unexpected results.

I'm agreeing with you about what's the most likely and common usecase,
but it shouldn't get too weird around the edges.

> > MADV_NONVOLATILE and MADV_NONVOLATILE_REPORT? (catchy, I know...)
>
> Something like this might be doable. I suspect the non-reporting
> non-volatile is more of a special case (reporting should probably be the
> default - as its safer), so it should probably have the longer name.

Sure thing.

> >> As for real-disk-backed file volatility, I'm not particularly interested
> >> in that, and fine with losing it. However, some have expressed
> >> theoretical interest that there may be cases where throwing the memory
> >> away is faster then writing it back to disk, so it might have some value
> >> there. But I don't have any concrete use cases that need it.
> > That could also be covered with an interface that clears dirty bits in
> > a range of pages.
>
> Well.. possibly. The issues with real files are ugly, since you don't
> want have stale data show up after the purge. In the past I proposed the
> hole punching for this, but that could be just as costly then writing
> the data back.
>
> Practically, I really don't see how true-file volatility makes any sense.
>
> The only rational interest was made by Dave Chinner, but what he really
> wanted was totally different. Something like a file-system persistent
> (instead of in-memory) volatility, so that filesystems could pick chunks
> of files to purge when disk space got tight.

Yeah, it might be a good idea to keep this separate and focus on
memory reclaim behavior for now.

It might even be beneficial if interfaces for memory volatility and
filesystem volatility are not easily mistaken for each other :)

> > The way I see it, the simplest design, the common denominator for
> > private anon, true file, shmem, tmpfs, would be for MADV/FADV_VOLATILE
> > to clear dirty bits off shared pages, or ptes/pmds in the private
> > mapped case to keep the COW charade intact. And for the NONVOLATILE
> > side to set dirty on what's still present and report if something is
> > missing.
>
> Hrmmm. This sounds reasonable, but I'm not sure its right. It seems the
> missing part here is (with anonymous scanning on swapless systems), we
> still have to set the volatility of the page somewhere so the LRU
> scanner will actually purge those made-clean pages, no? Otherwise won't
> anon and tmpfs files either just be swapped or passed over and left in
> memory? And again, for true-files, we don't want stale data.

During anon page reclaim, we walk all vmas that reference a page to
check its reference bits, then add it to swap (sets PageDirty), walk
all vmas again to unmap and install swap entries, then write it out
and reclaim it.

We should be able to modify the walk for reference bits to also check
for dirty bits, and then don't add a page without any to swap. It'll
be unmapped (AFAICS try_to_unmap_one needs minor tweak) and discarded
without swap.

Unless I'm missing something, clean shmem pages are discarded like any
other clean filesystem page.

> > Allocators and message passers don't care about content once volatile,
> > only about the memory. They wouldn't even have to go through the
> > non-volatile step anymore, they could just write to memory again and
> > it'll set the dirty bits and refault what's missing.
>
> So this would implicitly make any write effectively clear the volatility
> of a page? That probably would be an ok semantic with the use cases I'm
> aware of, but its new.

Yes.

> > Such an interface would be dead simple to use and consistent across
> > all types. The basic implementation would require only a couple of
> > lines of code, and while O(pages), it would still be much cheaper than
> > thrashing and swapping, and still cheaper than actively giving ranges
> > back to the kernel and reallocating and repopulating them later on.
> >
> > Compare this to the diffstat of the current vrange implementation and
> > the complexity and inconsistencies it introduces into the VM. I'm not
> > sure an O(pages) interface would be unattractive enough to justify it.
>
> Ok. So I think we're in agreement with:
> * Moving the volatility state for anonymous volatility into the VMA
> * Getting anonymous scanning going again on swapless systems

Cool!

> I'm still not totally sure about, but willing to try
> * Page granular volatile tracking

Okay.

> I'm still not convinced on:
> * madvise as a sufficient interface

Me neither :)

> I'll try to work out a draft of what your proposing (probably just for
> anonymous memory for now) and we can iterate from there?

Sounds good to me.

2014-01-31 16:49:20

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH v10 00/16] Volatile Ranges v10

On Wed, Jan 29, 2014 at 02:11:02PM +0900, Minchan Kim wrote:
> It's interesting timing, I posted this patch Yew Year's Day
> and receives indepth design review Lunar New Year's Day. :)
> It's almost 0-day review. :)

That's the only way I can do 0-day reviews ;)

> On Tue, Jan 28, 2014 at 07:03:59PM -0500, Johannes Weiner wrote:
> > Hello Minchan,
> >
> > On Thu, Jan 02, 2014 at 04:12:08PM +0900, Minchan Kim wrote:
> > > Hey all,
> > >
> > > Happy New Year!
> > >
> > > I know it's bad timing to send this unfamiliar large patchset for
> > > review but hope there are some guys with freshed-brain in new year
> > > all over the world. :)
> > > And most important thing is that before I dive into lots of testing,
> > > I'd like to make an agreement on design issues and others
> > >
> > > o Syscall interface
> >
> > Why do we need another syscall for this? Can't we extend madvise to
>
> Yeb. I should have written the reason. Early versions in this patchset
> had used madvise with VMA handling but it was terrible performance for
> ebizzy workload by mmap_sem's downside lock due to merging/split VMA.
> Even it was worse than old so I gave up the VMA approach.
>
> You could see the difference.
> https://lkml.org/lkml/2013/10/8/63

So the compared kernels are 4 releases apart and the test happened
inside a VM. It's also not really apparent from that link what the
tested workload is doing. We first have to agree that it's doing
nothing that could be avoided. E.g. we wouldn't introduce an
optimized version of write() because an application that writes 4G at
one byte per call is having problems.

The vroot lock has the same locking granularity as mmap_sem. Why is
mmap_sem more contended in this test?

> > take MADV_VOLATILE, MADV_NONVOLATILE, and return -ENOMEM if something
> > in the range was purged?
>
> In that case, -ENOMEM would have duplicated meaning "Purged" and "Out
> of memory so failed in the middle of the system call processing" and
> later could be a problem so we need to return value to indicate
> how many bytes are succeeded so far so it means we need additional
> out parameter. But yes, we can solve it by modifying semantic and
> behavior (ex, as you said below, we could just unmark volatile
> successfully if user pass (offset, len) consistent with marked volatile
> ranges. (IOW, if we give up overlapping/subrange marking/unmakring
> usecase. I expect it makes code simple further).
> It's request from John so If he is okay, I'm no problem.

Yes, I don't insist on using madvise. And it's too early to decide on
an interface before we haven't fully nailed the semantics and features.

> > > o Not bind with vma split/merge logic to prevent mmap_sem cost and
> > > o Not bind with vma split/merge logic to avoid vm_area_struct memory
> > > footprint.
> >
> > VMAs are there to track attributes of memory ranges. Duplicating
> > large parts of their functionality and co-maintaining both structures
> > on create, destroy, split, and merge means duplicate code and complex
> > interactions.
> >
> > 1. You need to define semantics and coordinate what happens when the
> > vma underlying a volatile range changes.
> >
> > Either you have to strictly co-maintain both range objects, or you
> > have weird behavior like volatily outliving a vma and then applying
> > to a separate vma created in its place.
> >
> > Userspace won't get this right, and even in the kernel this is
> > error prone and adds a lot to the complexity of vma management.
>
> Current semantic is following as,
> Vma handling logic in mm doesn't need to know vrange handling because
> vrange's internal logic always checks validity of the vma but
> one thing to do in vma logic is only clearing old volatile ranges
> on creating new vma.
> (Look at [PATCH v10 02/16] vrange: Clear volatility on new mmaps)
> Acutally I don't like the idea and suggested following as.
> https://git.kernel.org/cgit/linux/kernel/git/minchan/linux.git/commit/?h=vrange-working&id=821f58333b381fd88ee7f37fd9c472949756c74e
> But John didn't like it. I guess if VMA size is really matter,
> maybe we can embedded the flag into somewhere field of
> vma(ex, vm_file LSB?)

It's not entirely clear to me how the per-VMA variable can work like
that when vmas can merge and split by other means (mprotect e.g.)

> > 2. If page reclaim discards a page from the upper end of a a range,
> > you mark the whole range as purged. If the user later marks the
> > lower half of the range as non-volatile, the syscall will report
> > purged=1 even though all requested pages are still there.
>
> True, The assumption is that basically, user should have a range
> per object but we gives flexibility for user to handle subranges
> of a volatile range so it might report false positive as you said.
> In that case, please user can use mincore(2) for accuracy if he
> want so he has flexiblity but lose performance a bit.
> It's a tradeoff, IMO.

Look, we can't present a syscall that takes an exact range of bytes
and then return results that are not applicable to this range at all.

We can not make performance trade-offs that compromise the semantics
of the interface, and then recommend using an unrelated system call
that takes the same byte range but somehow gets it right.

> > The only way to make these semantics clean is either
> >
> > a) have vrange() return a range ID so that only full ranges can
> > later be marked non-volatile, or
>
> >
> > b) remember individual page purges so that sub-range changes can
> > properly report them
> >
> > I don't like a) much because it's somewhat arbitrarily more
> > restrictive than madvise, mprotect, mmap/munmap etc. And for b),
> > the straight-forward solution would be to put purge-cookies into
> > the page tables to properly report purges in subrange changes, but
> > that would be even more coordination between vmas, page tables, and
> > the ad-hoc vranges.
>
> Agree but I don't want to put a accuracy of defalut vrange syscall.
> Page table lookup needs mmap_sem and O(N) cost so I'm afraid it would
> make userland folks hesitant using this system call.

If userspace sees nothing but cost in this system call, nothing but a
voluntary donation for the common good of the system, then it does not
matter how cheap this is, nobody will use it. Why would they? Even
if it's a lightweight call, they still have to implement a mechanism
for regenerating content etc. It's still an investment to make, so
there has to be a personal benefit or it's flawed from the beginning.

So why do applications want to use it?

> > 3. Page reclaim usually happens on individual pages until an
> > allocation can be satisfied, but the shrinker purges entire ranges.
>
> Strictly speaking, not entire rangeS but entire A range.
> This recent patchset bails out if we discard as much as VM want.
>
> >
> > Should it really take out an entire 1G volatile range even though 4
> > pages would have been enough to satisfy an allocation? Sure, we
> > assume a range represents an single "object" and userspace would
> > have to regenerate the whole thing with only one page missing, but
> > there is still a massive difference in page frees, faults, and
> > allocations.
>
> That's why I wanted to introduce full and partial purging flag as system
> call argument.

I just wonder why we would anything but partial purging.

> > There needs to be a *really* good argument why VMAs are not enough for
> > this purpose. I would really like to see anon volatility implemented
>
> Strictly speaking, volatile ranges has two goals.
>
> 1. Avoid unncessary swapping or OOM if system has lots of volatile memory
> 2. Give advanced free rather than madvise(DONTNEED)

Aren't they the same goal? Giving applications a cheap way to
relinquish unused memory. If there is memory pressure, well, it was
unused anyway. If there isn't, the memory range can be reused without
another mmap() and page faults.

> First goal is very clear so I don't need to say again
> but it seems second goal isn't clear so that I try elaborate it a bit.
>
> Current allocators really hates frequent calling munmap which is big
> performance overhead if other threads are allocating new address
> space or are faulting existing address space so they have used
> madvise(DONTNEED) as optimization so at least, faulting threads
> works well in parallel. It's better but allocators couldn't use
> madvise(DONTNEED) frequently because it still prevent other thread's
> allocation of new address space for a long time(becuase the overhead
> of the system call is O(vma_size(vma)).

My suggestion of clearing dirty bits off of page table ranges would
require only read-side mmap_sem.

> Volatile ranges system call never don't need to hold write-side mmap_sem
> and the execution time is almost O(log(nr_range))) and if we follow your
> suggestion(ie, vrange returns ID), it's O(1). It's better.

I already wrote that to John: what if you have an array of objects and
want to mark them all volatile, but then come back for individual
objects in the array? If vrange() creates a single range and returns
an ID, you can't do this, unless you call vrange() for every single
object first.

O(1) is great, but we are duplicating VMA functionality, anon reclaim
functionality, have all these strange interactions, and a very
restricted interface.

We have to make trade-offs here and I don't want to have all this
complexity if there isn't a really solid reason for it.

> Another concern is that some of people want to handle range
> fine-granularity, maybe worst case, PAGE_SIZE, in that case
> so many VMA could be created if purging happens sparsely so it would
> be really memory concern.

That's also no problem if we implement it based on dirty page table
bits.

> > as a VMA attribute, and have regular reclaim decide based on rmap of
> > individual pages whether it needs to swap or purge. Something like
> > this:
> >
> > MADV_VOLATILE:
> > split vma if necessary
> > set VM_VOLATILE
> >
> > MADV_NONVOLATILE:
> > clear VM_VOLATILE
> > merge vma if possible
> > pte walk to check for pmd_purged()/pte_purged()
> > return any_purged
>
> It could make system call really slow so allocator people really
> would be reluctant to use it.

So what do they do instead? munmap() and refault the pages? Or sit
on a bunch of unused memory and get killed by the OOM killer? Or wait
on IO while their unused pages are swapped in and out?

2014-02-03 18:36:31

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH v10 00/16] Volatile Ranges v10

On Mon, Feb 03, 2014 at 03:58:06PM +0100, Jan Kara wrote:
> On Fri 31-01-14 11:49:01, Johannes Weiner wrote:
> > On Wed, Jan 29, 2014 at 02:11:02PM +0900, Minchan Kim wrote:
> > > > The only way to make these semantics clean is either
> > > >
> > > > a) have vrange() return a range ID so that only full ranges can
> > > > later be marked non-volatile, or
> > >
> > > >
> > > > b) remember individual page purges so that sub-range changes can
> > > > properly report them
> > > >
> > > > I don't like a) much because it's somewhat arbitrarily more
> > > > restrictive than madvise, mprotect, mmap/munmap etc. And for b),
> > > > the straight-forward solution would be to put purge-cookies into
> > > > the page tables to properly report purges in subrange changes, but
> > > > that would be even more coordination between vmas, page tables, and
> > > > the ad-hoc vranges.
> > >
> > > Agree but I don't want to put a accuracy of defalut vrange syscall.
> > > Page table lookup needs mmap_sem and O(N) cost so I'm afraid it would
> > > make userland folks hesitant using this system call.
> >
> > If userspace sees nothing but cost in this system call, nothing but a
> > voluntary donation for the common good of the system, then it does not
> > matter how cheap this is, nobody will use it. Why would they? Even
> I think this is a flawed logic. If you take it to the extreme then why
> each application doesn't allocate all the available memory and never free
> it? Because users will kick such application in the ass as soon as they
> have a viable alternative. So there is certainly a relatively strong
> benefit in being a good citizen on the system. But it's a matter of a
> tradeoff - if being a good citizen costs you too much (in the extreme if it
> would make the application hardly usable because it is too slow), then you
> just give up or hack it around in some other way...

Oh, that is exactly what I was trying to point out. The argument was
basically that it has to be as cheap and lightweight as humanly
possible because applications participate voluntarily and they won't
donate memory back if it comes at a cost.

And as you said, this is flawed. There is an incentive to give back
memory other than altruistic tendencies, namely the looming kick in
the butt.

So I very much agree that there is a trade-off to be had, but I think
the cost of the proposed implementation is not justified.

If we agree that simply not returning memory is unacceptable anyway,
providing an interface that is drastically cheaper than the current
means of returning memory is already an improvement. Even if it's
still O(#pages). So I think the incentive to use it is there. We
should design it to fit into the existing VM and then optimize it,
rather than design for an (unnecessary) optimization.

2014-02-04 01:09:20

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v10 00/16] Volatile Ranges v10

On Fri, Jan 31, 2014 at 11:49:01AM -0500, Johannes Weiner wrote:
> On Wed, Jan 29, 2014 at 02:11:02PM +0900, Minchan Kim wrote:
> > It's interesting timing, I posted this patch Yew Year's Day
> > and receives indepth design review Lunar New Year's Day. :)
> > It's almost 0-day review. :)
>
> That's the only way I can do 0-day reviews ;)
>
> > On Tue, Jan 28, 2014 at 07:03:59PM -0500, Johannes Weiner wrote:
> > > Hello Minchan,
> > >
> > > On Thu, Jan 02, 2014 at 04:12:08PM +0900, Minchan Kim wrote:
> > > > Hey all,
> > > >
> > > > Happy New Year!
> > > >
> > > > I know it's bad timing to send this unfamiliar large patchset for
> > > > review but hope there are some guys with freshed-brain in new year
> > > > all over the world. :)
> > > > And most important thing is that before I dive into lots of testing,
> > > > I'd like to make an agreement on design issues and others
> > > >
> > > > o Syscall interface
> > >
> > > Why do we need another syscall for this? Can't we extend madvise to
> >
> > Yeb. I should have written the reason. Early versions in this patchset
> > had used madvise with VMA handling but it was terrible performance for
> > ebizzy workload by mmap_sem's downside lock due to merging/split VMA.
> > Even it was worse than old so I gave up the VMA approach.
> >
> > You could see the difference.
> > https://lkml.org/lkml/2013/10/8/63
>
> So the compared kernels are 4 releases apart and the test happened
> inside a VM. It's also not really apparent from that link what the
> tested workload is doing. We first have to agree that it's doing
> nothing that could be avoided. E.g. we wouldn't introduce an
> optimized version of write() because an application that writes 4G at
> one byte per call is having problems.

About ebizzy workload, the process allocates several chunks then,
threads start to alloc own chunk and *copy( the content from random
chunk which was one of preallocated chunk to own chunk.
It means lots of threads are page-faulting so mmap_sem write-side
lock is really critical point for performance.
(I don't know ebizzy is really good for real practice but at least,
several papers and benchmark suites have used it so we couldn't
ignore. And per-thread allocator are really popular these days)

With VMA approach, we need mmap_sem write-side lock twice to mark/unmark
VM_VOLATILE in vma->vm_flags so with my experiment, the performance was
terrible as I said on link.

I don't think the situation of current kernel would be better than old.
And virtulization is really important technique thesedays so we couldn't
ignore that although I tested it on VM for convenience. If you want,
I surely can test it on bare box.

>
> The vroot lock has the same locking granularity as mmap_sem. Why is
> mmap_sem more contended in this test?

It seems above explanation is enough.

>
> > > take MADV_VOLATILE, MADV_NONVOLATILE, and return -ENOMEM if something
> > > in the range was purged?
> >
> > In that case, -ENOMEM would have duplicated meaning "Purged" and "Out
> > of memory so failed in the middle of the system call processing" and
> > later could be a problem so we need to return value to indicate
> > how many bytes are succeeded so far so it means we need additional
> > out parameter. But yes, we can solve it by modifying semantic and
> > behavior (ex, as you said below, we could just unmark volatile
> > successfully if user pass (offset, len) consistent with marked volatile
> > ranges. (IOW, if we give up overlapping/subrange marking/unmakring
> > usecase. I expect it makes code simple further).
> > It's request from John so If he is okay, I'm no problem.
>
> Yes, I don't insist on using madvise. And it's too early to decide on
> an interface before we haven't fully nailed the semantics and features.
>
> > > > o Not bind with vma split/merge logic to prevent mmap_sem cost and
> > > > o Not bind with vma split/merge logic to avoid vm_area_struct memory
> > > > footprint.
> > >
> > > VMAs are there to track attributes of memory ranges. Duplicating
> > > large parts of their functionality and co-maintaining both structures
> > > on create, destroy, split, and merge means duplicate code and complex
> > > interactions.
> > >
> > > 1. You need to define semantics and coordinate what happens when the
> > > vma underlying a volatile range changes.
> > >
> > > Either you have to strictly co-maintain both range objects, or you
> > > have weird behavior like volatily outliving a vma and then applying
> > > to a separate vma created in its place.
> > >
> > > Userspace won't get this right, and even in the kernel this is
> > > error prone and adds a lot to the complexity of vma management.
> >
> > Current semantic is following as,
> > Vma handling logic in mm doesn't need to know vrange handling because
> > vrange's internal logic always checks validity of the vma but
> > one thing to do in vma logic is only clearing old volatile ranges
> > on creating new vma.
> > (Look at [PATCH v10 02/16] vrange: Clear volatility on new mmaps)
> > Acutally I don't like the idea and suggested following as.
> > https://git.kernel.org/cgit/linux/kernel/git/minchan/linux.git/commit/?h=vrange-working&id=821f58333b381fd88ee7f37fd9c472949756c74e
> > But John didn't like it. I guess if VMA size is really matter,
> > maybe we can embedded the flag into somewhere field of
> > vma(ex, vm_file LSB?)
>
> It's not entirely clear to me how the per-VMA variable can work like
> that when vmas can merge and split by other means (mprotect e.g.)

I don't get it. What's problem in mprotect?
If mprotect try to merge puerged VMA and none-purged VMA,
it couldn't be merged.
If mprotect splits on purged VMA, both VMAs should have purged
state.
You are concerning of false-postive report?

>
> > > 2. If page reclaim discards a page from the upper end of a a range,
> > > you mark the whole range as purged. If the user later marks the
> > > lower half of the range as non-volatile, the syscall will report
> > > purged=1 even though all requested pages are still there.
> >
> > True, The assumption is that basically, user should have a range
> > per object but we gives flexibility for user to handle subranges
> > of a volatile range so it might report false positive as you said.
> > In that case, please user can use mincore(2) for accuracy if he
> > want so he has flexiblity but lose performance a bit.
> > It's a tradeoff, IMO.
>
> Look, we can't present a syscall that takes an exact range of bytes
> and then return results that are not applicable to this range at all.
>
> We can not make performance trade-offs that compromise the semantics
> of the interface, and then recommend using an unrelated system call
> that takes the same byte range but somehow gets it right.

Fair enough.

>
> > > The only way to make these semantics clean is either
> > >
> > > a) have vrange() return a range ID so that only full ranges can
> > > later be marked non-volatile, or
> >
> > >
> > > b) remember individual page purges so that sub-range changes can
> > > properly report them
> > >
> > > I don't like a) much because it's somewhat arbitrarily more
> > > restrictive than madvise, mprotect, mmap/munmap etc. And for b),
> > > the straight-forward solution would be to put purge-cookies into
> > > the page tables to properly report purges in subrange changes, but
> > > that would be even more coordination between vmas, page tables, and
> > > the ad-hoc vranges.
> >
> > Agree but I don't want to put a accuracy of defalut vrange syscall.
> > Page table lookup needs mmap_sem and O(N) cost so I'm afraid it would
> > make userland folks hesitant using this system call.
>
> If userspace sees nothing but cost in this system call, nothing but a
> voluntary donation for the common good of the system, then it does not
> matter how cheap this is, nobody will use it. Why would they? Even
> if it's a lightweight call, they still have to implement a mechanism
> for regenerating content etc. It's still an investment to make, so
> there has to be a personal benefit or it's flawed from the beginning.
>
> So why do applications want to use it?

In case of general allocator, sometime, madvise(DONTNEED) is really harmful
due to page-fault+allocation+zeroing

>
> > > 3. Page reclaim usually happens on individual pages until an
> > > allocation can be satisfied, but the shrinker purges entire ranges.
> >
> > Strictly speaking, not entire rangeS but entire A range.
> > This recent patchset bails out if we discard as much as VM want.
> >
> > >
> > > Should it really take out an entire 1G volatile range even though 4
> > > pages would have been enough to satisfy an allocation? Sure, we
> > > assume a range represents an single "object" and userspace would
> > > have to regenerate the whole thing with only one page missing, but
> > > there is still a massive difference in page frees, faults, and
> > > allocations.
> >
> > That's why I wanted to introduce full and partial purging flag as system
> > call argument.
>
> I just wonder why we would anything but partial purging.

I thought volatile pages are a not first class citizens.
I'd like to evict volatile pages firstly insead of working set.
Yes, it's really difficult to find when we should evict them so
that's one of reason why I introduced 09/16, where I just used
DER_PRIOIRTY - 2 for detecting memory pressure but it could be
changed easily with more smart algorithm in future.
Otherwise, we could deactivate volatile pages in tail of inactive LRU
when system call is called but it adds more time with write mmap_sem
so I'm not sure.

>
> > > There needs to be a *really* good argument why VMAs are not enough for
> > > this purpose. I would really like to see anon volatility implemented
> >
> > Strictly speaking, volatile ranges has two goals.
> >
> > 1. Avoid unncessary swapping or OOM if system has lots of volatile memory
> > 2. Give advanced free rather than madvise(DONTNEED)
>
> Aren't they the same goal? Giving applications a cheap way to
> relinquish unused memory. If there is memory pressure, well, it was
> unused anyway. If there isn't, the memory range can be reused without
> another mmap() and page faults.

Goal is same but implemtation should be different.
In case of 2, we should really avoid write mmap_sem.

>
> > First goal is very clear so I don't need to say again
> > but it seems second goal isn't clear so that I try elaborate it a bit.
> >
> > Current allocators really hates frequent calling munmap which is big
> > performance overhead if other threads are allocating new address
> > space or are faulting existing address space so they have used
> > madvise(DONTNEED) as optimization so at least, faulting threads
> > works well in parallel. It's better but allocators couldn't use
> > madvise(DONTNEED) frequently because it still prevent other thread's
> > allocation of new address space for a long time(becuase the overhead
> > of the system call is O(vma_size(vma)).
>
> My suggestion of clearing dirty bits off of page table ranges would
> require only read-side mmap_sem.

So, you are suggesting the approach which doesn't mark/unmark VM_VOLATILE
in vma->vm_flags?

>
> > Volatile ranges system call never don't need to hold write-side mmap_sem
> > and the execution time is almost O(log(nr_range))) and if we follow your
> > suggestion(ie, vrange returns ID), it's O(1). It's better.
>
> I already wrote that to John: what if you have an array of objects and
> want to mark them all volatile, but then come back for individual
> objects in the array? If vrange() creates a single range and returns
> an ID, you can't do this, unless you call vrange() for every single
> object first.
>
> O(1) is great, but we are duplicating VMA functionality, anon reclaim
> functionality, have all these strange interactions, and a very
> restricted interface.

I agree with your concern and that's why I tried volatile ranges
with VMA approach in earier versions but it was terrible for allocators. :(

>
> We have to make trade-offs here and I don't want to have all this
> complexity if there isn't a really solid reason for it.
>
> > Another concern is that some of people want to handle range
> > fine-granularity, maybe worst case, PAGE_SIZE, in that case
> > so many VMA could be created if purging happens sparsely so it would
> > be really memory concern.
>
> That's also no problem if we implement it based on dirty page table
> bits.

Still, marking/unmarking VM_VOLATILE is really problem.

>
> > > as a VMA attribute, and have regular reclaim decide based on rmap of
> > > individual pages whether it needs to swap or purge. Something like
> > > this:
> > >
> > > MADV_VOLATILE:
> > > split vma if necessary
> > > set VM_VOLATILE
> > >
> > > MADV_NONVOLATILE:
> > > clear VM_VOLATILE
> > > merge vma if possible
> > > pte walk to check for pmd_purged()/pte_purged()
> > > return any_purged
> >
> > It could make system call really slow so allocator people really
> > would be reluctant to use it.
>
> So what do they do instead? munmap() and refault the pages? Or sit
> on a bunch of unused memory and get killed by the OOM killer? Or wait
> on IO while their unused pages are swapped in and out?

The more I discuss with you, the more I convince that we should
separate normal volatile ranges's usecase and allocators's one.
Allocator doesn't need to look back purged ranges so it would
be unnecesary unmarking VM_VOLATILE and even it doen't need to
mark VM_VOLATILE of vma. If so, it's really same semantic with
MADV_FREE. Although MADV_FREE is O(#pages), it wouldn't be big
overhead as you said because trend are huge pages and it doen't
need to require mmap_sem write-side lock.
As bonus point, if we take the usecase out from volatile semantics,
we're okay to put more overhead into vrange syscall so it turns
out VMA approach would be good and everyone is happy?

Still problem I am concerning that WHEN we should evict volatile
pages which are second class citizens?

Thanks for the comment, Hannes!

>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

--
Kind regards,
Minchan Kim