2013-03-12 07:38:53

by Minchan Kim

[permalink] [raw]
Subject: [RFC v7 00/11] Support vrange for anonymous page

First of all, let's define the term.
>From now on, I'd like to call it as vrange(a.k.a volatile range)
for anonymous page. If you have a better name in mind, please suggest.

This version is still *RFC* because it's just quick prototype so
it doesn't support THP/HugeTLB/KSM and even couldn't build on !x86.
Before further sorting out issues, I'd like to post current direction
and discuss it. Of course, I'd like to extend this discussion in
comming LSF/MM.

In this version, I changed lots of thing, expecially removed vma-based
approach because it needs write-side lock for mmap_sem, which will drop
performance in mutli-threaded big SMP system, KOSAKI pointed out.
And vma-based approach is hard to meet requirement of new system call by
John Stultz's suggested semantic for consistent purged handling.
(http://linux-kernel.2935.n7.nabble.com/RFC-v5-0-8-Support-volatile-for-anonymous-range-tt575773.html#none)

I tested this patchset with modified jemalloc allocator which was
leaded by Jason Evans(jemalloc author) who was interest in this feature
and was happy to port his allocator to use new system call.
Super Thanks Jason!

The benchmark for test is ebizzy. It have been used for testing the
allocator performance so it's good for me. Again, thanks for recommending
the benchmark, Jason.
(http://people.freebsd.org/~kris/scaling/ebizzy.html)

The result is good on my machine (12 CPU, 1.2GHz, DRAM 2G)

ebizzy -S 20

jemalloc-vanilla: 52389 records/sec
jemalloc-vrange: 203414 records/sec

ebizzy -S 20 with background memory pressure

jemalloc-vanilla: 40746 records/sec
jemalloc-vrange: 174910 records/sec

And it's much improved on KVM virtual machine.

This patchset is based on v3.9-rc2

- What's the sys_vrange(addr, length, mode, behavior)?

It's a hint that user deliver to kernel so kernel can *discard*
pages in a range anytime. mode is one of VRANGE_VOLATILE and
VRANGE_NOVOLATILE. VRANGE_NOVOLATILE is memory pin operation so
kernel coudn't discard any pages any more while VRANGE_VOLATILE
is memory unpin opeartion so kernel can discard pages in vrange
anytime. At a moment, behavior is one of VRANGE_FULL and VRANGE
PARTIAL. VRANGE_FULL tell kernel that once kernel decide to
discard page in a vrange, please, discard all of pages in a
vrange selected by victim vrange. VRANGE_PARTIAL tell kernel
that please discard of some pages in a vrange. But now I didn't
implemented VRANGE_PARTIAL handling yet.

- What happens if user access page(ie, virtual address) discarded
by kernel?

The user can encounter SIGBUS.

- What should user do for avoding SIGBUS?
He should call vrange(addr, length, VRANGE_NOVOLATILE, mode) before
accessing the range which was called
vrange(addr, length, VRANGE_VOLATILE, mode)

- What happens if user access page(ie, virtual address) doesn't
discarded by kernel?

The user can see vaild data which was there before calling
vrange(., VRANGE_VOLATILE) without page fault.

- What's different with madvise(DONTNEED)?

System call semantic

DONTNEED makes sure user always can see zero-fill pages after
he calls madvise while vrange can see data or encounter SIGBUS.

Internal implementation

The madvise(DONTNEED) should zap all mapped pages in range so
overhead is increased linearly with the number of mapped pages.
Even, if user access zapped pages as write mode, page fault +
page allocation + memset should be happened.

The vrange just register a address range instead of zapping all of pte
n the vma so it doesn't touch ptes any more.

- What's the benefit compared to DONTNEED?

1. The system call overhead is smaller because vrange just registers
a range using interval tree instead of zapping all the page in a range
so overhead should be really cheap.

2. It has a chance to eliminate overheads (ex, zapping pte + page fault
+ page allocation + memset(PAGE_SIZE)) if memory pressure isn't
severe.

3. It has a potential to zap all ptes and free the pages if memory
pressure is severe so discard scanning overhead could be smaller - TODO

- What's for targetting?

Firstly, user-space allocator like ptmalloc, jemalloc or heap management
of virtual machine like Dalvik. Also, it comes in handy for embedded
which doesn't have swap device so they can't reclaim anonymous pages.
By discarding instead of swapout, it could be used in the non-swap system.

Changelog from v6 - There are many changes.
* Remove vma-based approach
* Change system call semantic
* Add more meaningful experiment

Changelog from v5 - There are many changes.

* Support CONFIG_VOLATILE_PAGE
* Working with THP/KSM
* Remove vma hacking logic in m[no]volatile system call
* Discard page without swap cache
* Kswapd discard volatile page so we can discard volatile pages
although we don't have swap.

Changelog from v4

* Add new system call mvolatile/mnovolatile
* Add sigbus when user try to access volatile range
* Rebased on v3.7
* Applied bug fix from John Stultz, Thanks!

Changelog from v3

* Removing madvise(addr, length, MADV_NOVOLATILE).
* add vmstat about the number of discarded volatile pages
* discard volatile pages without promotion in reclaim path

Minchan Kim (11):
vrange: enable generic interval tree
add vrange basic data structure and functions
add new system call vrange(2)
add proc/pid/vrange information
Add purge operation
send SIGBUS when user try to access purged page
keep mm_struct to vrange when system call context
add LRU handling for victim vrange
Get rid of depenceny that all pages is from a zone in shrink_page_list
Purging vrange pages without swap
add purged page information in vmstat

arch/x86/include/asm/pgtable_types.h | 2 +
arch/x86/syscalls/syscall_64.tbl | 1 +
fs/proc/base.c | 1 +
fs/proc/internal.h | 6 +
fs/proc/task_mmu.c | 129 ++++++
include/asm-generic/pgtable.h | 11 +
include/linux/mm_types.h | 5 +
include/linux/rmap.h | 15 +-
include/linux/swap.h | 1 +
include/linux/vm_event_item.h | 4 +
include/linux/vrange.h | 59 +++
include/uapi/asm-generic/mman-common.h | 5 +
init/main.c | 2 +
kernel/fork.c | 3 +
lib/Makefile | 2 +-
mm/Makefile | 2 +-
mm/ksm.c | 2 +-
mm/memory.c | 24 +-
mm/rmap.c | 23 +-
mm/swapfile.c | 36 ++
mm/vmscan.c | 74 +++-
mm/vmstat.c | 4 +
mm/vrange.c | 754 +++++++++++++++++++++++++++++++++
23 files changed, 1143 insertions(+), 22 deletions(-)
create mode 100644 include/linux/vrange.h
create mode 100644 mm/vrange.c

--
1.8.1.1


2013-03-12 07:38:58

by Minchan Kim

[permalink] [raw]
Subject: [RFC v7 01/11] vrange: enable generic interval tree

Anon vrange patch will use generic interval tree so let's enable it.

Cc: Michel Lespinasse <[email protected]>
Signed-off-by: Minchan Kim <[email protected]>
---
lib/Makefile | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/lib/Makefile b/lib/Makefile
index d7946ff..17986fb 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -13,7 +13,7 @@ lib-y := ctype.o string.o vsprintf.o cmdline.o \
sha1.o md5.o irq_regs.o reciprocal_div.o argv_split.o \
proportions.o flex_proportions.o prio_heap.o ratelimit.o show_mem.o \
is_single_threaded.o plist.o decompress.o kobject_uevent.o \
- earlycpio.o
+ earlycpio.o interval_tree.o

lib-$(CONFIG_MMU) += ioremap.o
lib-$(CONFIG_SMP) += cpumask.o
--
1.8.1.1

2013-03-12 07:39:06

by Minchan Kim

[permalink] [raw]
Subject: [RFC v7 11/11] add purged page information in vmstat

This patch adds vmstat information about discarded page in vrange
so admin can see how many of volatile pages are discarded by VM
and efficieny. it could be indicator for seeing vrange working
well.

PG_VRANGE_SCAN : the number of scanned pages for discarding
PG_VRANGE_DISCARD: the number of discarded pages in kswapd's vrange LRU order
PGDISCARD_DIRECT : the number of discarded pages in process context
PGDISCARD_KSWAPD : the number of discarded pages in kswapd's page LRU order

Signed-off-by: Minchan Kim <[email protected]>
---
include/linux/vm_event_item.h | 4 ++++
mm/vmstat.c | 4 ++++
mm/vrange.c | 9 +++++++++
3 files changed, 17 insertions(+)

diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index bd6cf61..3d8ad18 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -25,6 +25,10 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
FOR_ALL_ZONES(PGALLOC),
PGFREE, PGACTIVATE, PGDEACTIVATE,
PGFAULT, PGMAJFAULT,
+ PG_VRANGE_SCAN,
+ PG_VRANGE_DISCARD,
+ PGDISCARD_DIRECT,
+ PGDISCARD_KSWAPD,
FOR_ALL_ZONES(PGREFILL),
FOR_ALL_ZONES(PGSTEAL_KSWAPD),
FOR_ALL_ZONES(PGSTEAL_DIRECT),
diff --git a/mm/vmstat.c b/mm/vmstat.c
index e1d8ed1..55806d2 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -754,6 +754,10 @@ const char * const vmstat_text[] = {

"pgfault",
"pgmajfault",
+ "pgvrange_scan",
+ "pgvrange_discard",
+ "pgdiscard_direct",
+ "pgdiscard_kswapd",

TEXTS_FOR_ZONES("pgrefill")
TEXTS_FOR_ZONES("pgsteal_kswapd")
diff --git a/mm/vrange.c b/mm/vrange.c
index 2f56d36..c0c5d50 100644
--- a/mm/vrange.c
+++ b/mm/vrange.c
@@ -518,6 +518,10 @@ int discard_vpage(struct page *page)
if (page_freeze_refs(page, 1)) {
unlock_page(page);
dec_zone_page_state(page, NR_ISOLATED_ANON);
+ if (current_is_kswapd())
+ count_vm_event(PGDISCARD_KSWAPD);
+ else
+ count_vm_event(PGDISCARD_DIRECT);
return 1;
}
}
@@ -584,11 +588,15 @@ static int vrange_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
{
pte_t *pte;
spinlock_t *ptl;
+ unsigned long start = addr;

pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
for (; addr != end; pte++, addr += PAGE_SIZE)
vrange_pte_entry(*pte, addr, PAGE_SIZE, walk);
pte_unmap_unlock(pte - 1, ptl);
+
+ count_vm_events(PG_VRANGE_SCAN, (end - start) / PAGE_SIZE);
+
cond_resched();
return 0;

@@ -741,5 +749,6 @@ unsigned int discard_vrange_pages(struct zone *zone, int nr_to_discard)
if (start_vrange)
put_victim_range(start_vrange);

+ count_vm_events(PG_VRANGE_DISCARD, nr_discarded);
return nr_discarded;
}
--
1.8.1.1

2013-03-12 07:39:04

by Minchan Kim

[permalink] [raw]
Subject: [RFC v7 07/11] keep mm_struct to vrange when system call context

We need mm_struct for discarding vrange pages in kswapd context.
It's a preparatoin for it.

Signed-off-by: Minchan Kim <[email protected]>
---
include/linux/vrange.h | 1 +
mm/vrange.c | 20 +++++++++++---------
2 files changed, 12 insertions(+), 9 deletions(-)

diff --git a/include/linux/vrange.h b/include/linux/vrange.h
index 24ed4c1..5238a67 100644
--- a/include/linux/vrange.h
+++ b/include/linux/vrange.h
@@ -11,6 +11,7 @@ static DECLARE_RWSEM(vrange_fork_lock);
struct vrange {
struct interval_tree_node node;
bool purged;
+ struct mm_struct *mm;
};

#define vrange_entry(ptr) \
diff --git a/mm/vrange.c b/mm/vrange.c
index 89fcae4..f4c1d04 100644
--- a/mm/vrange.c
+++ b/mm/vrange.c
@@ -29,8 +29,9 @@ static inline void __set_vrange(struct vrange *range,
}

static void __add_range(struct vrange *range,
- struct rb_root *root)
+ struct rb_root *root, struct mm_struct *mm)
{
+ range->mm = mm;
interval_tree_insert(&range->node, root);
}

@@ -52,11 +53,12 @@ static void free_vrange(struct vrange *range)

static inline void range_resize(struct rb_root *root,
struct vrange *range,
- unsigned long start, unsigned long end)
+ unsigned long start, unsigned long end,
+ struct mm_struct *mm)
{
__remove_range(range, root);
__set_vrange(range, start, end);
- __add_range(range, root);
+ __add_range(range, root, mm);
}

int add_vrange(struct mm_struct *mm,
@@ -95,8 +97,7 @@ int add_vrange(struct mm_struct *mm,

__set_vrange(new_range, start, end);
new_range->purged = purged;
-
- __add_range(new_range, root);
+ __add_range(new_range, root, mm);
out:
vrange_unlock(mm);
return 0;
@@ -129,15 +130,16 @@ int remove_vrange(struct mm_struct *mm,
__remove_range(range, root);
free_vrange(range);
} else if (node->start >= start) {
- range_resize(root, range, end, node->last);
+ range_resize(root, range, end, node->last, mm);
} else if (node->last <= end) {
- range_resize(root, range, node->start, start);
+ range_resize(root, range, node->start, start, mm);
} else {
used_new = true;
__set_vrange(new_range, end, node->last);
new_range->purged = range->purged;
- range_resize(root, range, node->start, start);
- __add_range(new_range, root);
+ new_range->mm = mm;
+ range_resize(root, range, node->start, start, mm);
+ __add_range(new_range, root, mm);
break;
}

--
1.8.1.1

2013-03-12 07:39:01

by Minchan Kim

[permalink] [raw]
Subject: [RFC v7 08/11] add LRU handling for victim vrange

This patch adds LRU data structure for selecting victim vrange
when memory pressure happens.

Basically, VM will select old vrange but if user try to access
purged page recenlty, the vrange includes the page will be activated
because page fault means one of them which user process will be
killed or recover SIGBUS and continue the work. For latter case,
we have to keep the vrange out of vicim selection.

I admit LRU might be not best but I can't imagine better idea
so wanted to make it simple. I think user space can handle better
with enough information so hope they handle it via mempressure
notifier. Otherwise, if you have better idea, welcome!

Signed-off-by: Minchan Kim <[email protected]>
---
include/linux/vrange.h | 4 ++++
mm/memory.c | 1 +
mm/vrange.c | 48 +++++++++++++++++++++++++++++++++++++++++++++++-
3 files changed, 52 insertions(+), 1 deletion(-)

diff --git a/include/linux/vrange.h b/include/linux/vrange.h
index 5238a67..26db168 100644
--- a/include/linux/vrange.h
+++ b/include/linux/vrange.h
@@ -12,6 +12,7 @@ struct vrange {
struct interval_tree_node node;
bool purged;
struct mm_struct *mm;
+ struct list_head lru; /* protected by lru_lock */
};

#define vrange_entry(ptr) \
@@ -44,6 +45,9 @@ bool vrange_address(struct mm_struct *mm, unsigned long start,

extern bool is_purged_vrange(struct mm_struct *mm, unsigned long address);

+unsigned int discard_vrange_pages(struct zone *zone, int nr_to_discard);
+void lru_move_vrange_to_head(struct mm_struct *mm, unsigned long address);
+
#else

static inline void vrange_init(void) {};
diff --git a/mm/memory.c b/mm/memory.c
index cc369ab..3cb0633 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3671,6 +3671,7 @@ anon:

if (unlikely(pte_vrange(entry))) {
if (!is_purged_vrange(mm, address)) {
+ lru_move_vrange_to_head(mm, address);
/* zap pte */
ptl = pte_lockptr(mm, pmd);
spin_lock(ptl);
diff --git a/mm/vrange.c b/mm/vrange.c
index f4c1d04..b9b1ffa 100644
--- a/mm/vrange.c
+++ b/mm/vrange.c
@@ -14,6 +14,9 @@
#include <linux/swapops.h>
#include <linux/mmu_notifier.h>

+static LIST_HEAD(lru_vrange);
+static DEFINE_SPINLOCK(lru_lock);
+
static struct kmem_cache *vrange_cachep;

void __init vrange_init(void)
@@ -28,10 +31,50 @@ static inline void __set_vrange(struct vrange *range,
range->node.last = end_idx;
}

+void lru_add_vrange(struct vrange *vrange)
+{
+ spin_lock(&lru_lock);
+ WARN_ON(!list_empty(&vrange->lru));
+ list_add(&vrange->lru, &lru_vrange);
+ spin_unlock(&lru_lock);
+}
+
+void lru_remove_vrange(struct vrange *vrange)
+{
+ spin_lock(&lru_lock);
+ if (!list_empty(&vrange->lru))
+ list_del_init(&vrange->lru);
+ spin_unlock(&lru_lock);
+}
+
+void lru_move_vrange_to_head(struct mm_struct *mm, unsigned long address)
+{
+ struct rb_root *root = &mm->v_rb;
+ struct interval_tree_node *node;
+ struct vrange *vrange;
+
+ vrange_lock(mm);
+ node = interval_tree_iter_first(root, address, address + PAGE_SIZE - 1);
+ if (node) {
+ vrange = container_of(node, struct vrange, node);
+ spin_lock(&lru_lock);
+ /*
+ * Race happens with get_victim_vrange so in such case,
+ * we can't move but it can put the vrange into head
+ * after finishing purging work so no problem.
+ */
+ if (!list_empty(&vrange->lru))
+ list_move(&vrange->lru, &lru_vrange);
+ spin_unlock(&lru_lock);
+ }
+ vrange_unlock(mm);
+}
+
static void __add_range(struct vrange *range,
struct rb_root *root, struct mm_struct *mm)
{
range->mm = mm;
+ lru_add_vrange(range);
interval_tree_insert(&range->node, root);
}

@@ -43,11 +86,14 @@ static void __remove_range(struct vrange *range,

static struct vrange *alloc_vrange(void)
{
- return kmem_cache_alloc(vrange_cachep, GFP_KERNEL);
+ struct vrange *vrange = kmem_cache_alloc(vrange_cachep, GFP_KERNEL);
+ INIT_LIST_HEAD(&vrange->lru);
+ return vrange;
}

static void free_vrange(struct vrange *range)
{
+ lru_remove_vrange(range);
kmem_cache_free(vrange_cachep, range);
}

--
1.8.1.1

2013-03-12 07:39:44

by Minchan Kim

[permalink] [raw]
Subject: [RFC v7 10/11] Purging vrange pages without swap

Now one of problem in vrange is VM reclaim anonymous pages if only
there is a swap system. This patch adds new hook in kswapd where
above scanning normal LRU pages.

This patch discards all of pages of vmas in vrange without
considering VRANGE_[FULL|PARTIAL]_MODE, which will be considered
in future work.

I should confess that I didn't spend enough time to investigate
where is good place for hook. Even, It might be better to add new
kvranged thread because there are a few bugs these days in kswapd,
which was very sensitive for a small change so adding new hooks
may make subtle another problem.

It could be better to move vrange code into kswapd after settle down
in kvrangd. Otherwise, we could leave at it is in kvranged.

Other issue is scanning cost of virtual address. We don't have any
information of rss in each VMA so kswapd can scan all address without
any gain. It can burn out CPU. I have a plan to account rss
by per-VMA at least, anonymous vma.

Any comment are welcome!

Signed-off-by: Minchan Kim <[email protected]>
---
include/linux/rmap.h | 3 +
include/linux/vrange.h | 4 +-
mm/vmscan.c | 45 +++++++++-
mm/vrange.c | 239 +++++++++++++++++++++++++++++++++++++++++++++++--
4 files changed, 279 insertions(+), 12 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 6432dfb..e822a30 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -83,6 +83,9 @@ enum ttu_flags {
};

#ifdef CONFIG_MMU
+unsigned long discard_vrange_page_list(struct zone *zone,
+ struct list_head *page_list);
+
unsigned long vma_address(struct page *page, struct vm_area_struct *vma);

static inline void get_anon_vma(struct anon_vma *anon_vma)
diff --git a/include/linux/vrange.h b/include/linux/vrange.h
index 26db168..4bcec40 100644
--- a/include/linux/vrange.h
+++ b/include/linux/vrange.h
@@ -5,14 +5,12 @@
#include <linux/interval_tree.h>
#include <linux/mm.h>

-/* To protect race with forker */
-static DECLARE_RWSEM(vrange_fork_lock);
-
struct vrange {
struct interval_tree_node node;
bool purged;
struct mm_struct *mm;
struct list_head lru; /* protected by lru_lock */
+ atomic_t refcount;
};

#define vrange_entry(ptr) \
diff --git a/mm/vmscan.c b/mm/vmscan.c
index e36ee51..2220ce7 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -683,7 +683,7 @@ static enum page_references page_check_references(struct page *page,
/*
* shrink_page_list() returns the number of reclaimed pages
*/
-static unsigned long shrink_page_list(struct list_head *page_list,
+unsigned long shrink_page_list(struct list_head *page_list,
struct zone *zone,
struct scan_control *sc,
enum ttu_flags ttu_flags,
@@ -985,6 +985,35 @@ keep:
return nr_reclaimed;
}

+
+unsigned long discard_vrange_page_list(struct zone *zone,
+ struct list_head *page_list)
+{
+ unsigned long ret;
+ struct scan_control sc = {
+ .gfp_mask = GFP_KERNEL,
+ .priority = DEF_PRIORITY,
+ .may_unmap = 1,
+ .may_swap = 1,
+ .may_discard = 1
+ };
+
+ unsigned long dummy1, dummy2;
+ struct page *page;
+
+ list_for_each_entry(page, page_list, lru) {
+ VM_BUG_ON(!PageAnon(page));
+ ClearPageActive(page);
+ }
+
+ /* page_list have pages from multiple zones */
+ ret = shrink_page_list(page_list, NULL, &sc,
+ TTU_UNMAP|TTU_IGNORE_ACCESS,
+ &dummy1, &dummy2, false);
+ __mod_zone_page_state(zone, NR_ISOLATED_ANON, -ret);
+ return ret;
+}
+
unsigned long reclaim_clean_pages_from_list(struct zone *zone,
struct list_head *page_list)
{
@@ -2781,6 +2810,16 @@ loop_again:
if ((buffer_heads_over_limit && is_highmem_idx(i)) ||
!zone_balanced(zone, testorder,
balance_gap, end_zone)) {
+
+ unsigned int nr_discard;
+ if (testorder == 0) {
+ nr_discard = discard_vrange_pages(zone,
+ SWAP_CLUSTER_MAX);
+ sc.nr_reclaimed += nr_discard;
+ if (zone_balanced(zone, testorder, 0,
+ end_zone))
+ goto zone_balanced;
+ }
shrink_zone(zone, &sc);

reclaim_state->reclaimed_slab = 0;
@@ -2805,7 +2844,8 @@ loop_again:
continue;
}

- if (zone_balanced(zone, testorder, 0, end_zone))
+ if (zone_balanced(zone, testorder, 0, end_zone)) {
+zone_balanced:
/*
* If a zone reaches its high watermark,
* consider it to be no longer congested. It's
@@ -2814,6 +2854,7 @@ loop_again:
* speculatively avoid congestion waits
*/
zone_clear_flag(zone, ZONE_CONGESTED);
+ }
}

/*
diff --git a/mm/vrange.c b/mm/vrange.c
index b9b1ffa..2f56d36 100644
--- a/mm/vrange.c
+++ b/mm/vrange.c
@@ -13,15 +13,29 @@
#include <linux/swap.h>
#include <linux/swapops.h>
#include <linux/mmu_notifier.h>
+#include <linux/migrate.h>
+
+struct vrange_walker_private {
+ struct zone *zone;
+ struct vm_area_struct *vma;
+ struct list_head *pagelist;
+};

static LIST_HEAD(lru_vrange);
static DEFINE_SPINLOCK(lru_lock);

static struct kmem_cache *vrange_cachep;

+static void vrange_ctor(void *data)
+{
+ struct vrange *vrange = data;
+ INIT_LIST_HEAD(&vrange->lru);
+}
+
void __init vrange_init(void)
{
- vrange_cachep = KMEM_CACHE(vrange, SLAB_PANIC);
+ vrange_cachep = kmem_cache_create("vrange", sizeof(struct vrange),
+ 0, SLAB_PANIC, vrange_ctor);
}

static inline void __set_vrange(struct vrange *range,
@@ -78,6 +92,7 @@ static void __add_range(struct vrange *range,
interval_tree_insert(&range->node, root);
}

+/* remove range from interval tree */
static void __remove_range(struct vrange *range,
struct rb_root *root)
{
@@ -87,7 +102,8 @@ static void __remove_range(struct vrange *range,
static struct vrange *alloc_vrange(void)
{
struct vrange *vrange = kmem_cache_alloc(vrange_cachep, GFP_KERNEL);
- INIT_LIST_HEAD(&vrange->lru);
+ if (vrange)
+ atomic_set(&vrange->refcount, 1);
return vrange;
}

@@ -97,6 +113,13 @@ static void free_vrange(struct vrange *range)
kmem_cache_free(vrange_cachep, range);
}

+static void put_vrange(struct vrange *range)
+{
+ WARN_ON(atomic_read(&range->refcount) < 0);
+ if (atomic_dec_and_test(&range->refcount))
+ free_vrange(range);
+}
+
static inline void range_resize(struct rb_root *root,
struct vrange *range,
unsigned long start, unsigned long end,
@@ -127,7 +150,7 @@ int add_vrange(struct mm_struct *mm,

range = container_of(node, struct vrange, node);
if (node->start < start && node->last > end) {
- free_vrange(new_range);
+ put_vrange(new_range);
goto out;
}

@@ -136,7 +159,7 @@ int add_vrange(struct mm_struct *mm,

purged |= range->purged;
__remove_range(range, root);
- free_vrange(range);
+ put_vrange(range);

node = next;
}
@@ -174,7 +197,7 @@ int remove_vrange(struct mm_struct *mm,

if (start <= node->start && end >= node->last) {
__remove_range(range, root);
- free_vrange(range);
+ put_vrange(range);
} else if (node->start >= start) {
range_resize(root, range, end, node->last, mm);
} else if (node->last <= end) {
@@ -194,7 +217,7 @@ int remove_vrange(struct mm_struct *mm,

vrange_unlock(mm);
if (!used_new)
- free_vrange(new_range);
+ put_vrange(new_range);

return ret;
}
@@ -209,7 +232,7 @@ void exit_vrange(struct mm_struct *mm)
range = vrange_entry(next);
next = rb_next(next);
__remove_range(range, &mm->v_rb);
- free_vrange(range);
+ put_vrange(range);
}
}

@@ -494,6 +517,7 @@ int discard_vpage(struct page *page)

if (page_freeze_refs(page, 1)) {
unlock_page(page);
+ dec_zone_page_state(page, NR_ISOLATED_ANON);
return 1;
}
}
@@ -518,3 +542,204 @@ bool is_purged_vrange(struct mm_struct *mm, unsigned long address)
vrange_unlock(mm);
return ret;
}
+
+static void vrange_pte_entry(pte_t pteval, unsigned long address,
+ unsigned ptent_size, struct mm_walk *walk)
+{
+ struct page *page;
+ struct vrange_walker_private *vwp = walk->private;
+ struct vm_area_struct *vma = vwp->vma;
+ struct list_head *pagelist = vwp->pagelist;
+ struct zone *zone = vwp->zone;
+
+ if (pte_none(pteval))
+ return;
+
+ if (!pte_present(pteval))
+ return;
+
+ page = vm_normal_page(vma, address, pteval);
+ if (unlikely(!page))
+ return;
+
+ if (!PageLRU(page) || PageLocked(page) || !PageAnon(page))
+ return;
+
+ /* TODO : Support THP and HugeTLB */
+ if (unlikely(PageCompound(page)))
+ return;
+
+ if (zone_idx(page_zone(page)) > zone_idx(zone))
+ return;
+
+ if (isolate_lru_page(page))
+ return;
+
+ list_add(&page->lru, pagelist);
+ inc_zone_page_state(page, NR_ISOLATED_ANON);
+}
+
+static int vrange_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
+ struct mm_walk *walk)
+{
+ pte_t *pte;
+ spinlock_t *ptl;
+
+ pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
+ for (; addr != end; pte++, addr += PAGE_SIZE)
+ vrange_pte_entry(*pte, addr, PAGE_SIZE, walk);
+ pte_unmap_unlock(pte - 1, ptl);
+ cond_resched();
+ return 0;
+
+}
+
+unsigned int discard_vma_pages(struct zone *zone, struct mm_struct *mm,
+ struct vm_area_struct *vma, unsigned long start,
+ unsigned long end, unsigned int nr_to_discard)
+{
+ LIST_HEAD(pagelist);
+ int ret = 0;
+ struct vrange_walker_private vwp;
+ struct mm_walk vrange_walk = {
+ .pmd_entry = vrange_pte_range,
+ .mm = vma->vm_mm,
+ .private = &vwp,
+ };
+
+ vwp.pagelist = &pagelist;
+ vwp.vma = vma;
+ vwp.zone = zone;
+
+ walk_page_range(start, end, &vrange_walk);
+
+ if (!list_empty(&pagelist))
+ ret = discard_vrange_page_list(zone, &pagelist);
+
+ putback_lru_pages(&pagelist);
+ return ret;
+}
+
+unsigned int discard_vrange(struct zone *zone, struct vrange *vrange,
+ int nr_to_discard)
+{
+ struct mm_struct *mm = vrange->mm;
+ unsigned long start = vrange->node.start;
+ unsigned long end = vrange->node.last;
+ struct vm_area_struct *vma;
+ unsigned int nr_discarded = 0;
+
+ if (!down_read_trylock(&mm->mmap_sem))
+ goto out;
+
+ vma = find_vma(mm, start);
+ if (!vma || (vma->vm_start > end))
+ goto out_unlock;
+
+ for (; vma; vma = vma->vm_next) {
+ if (vma->vm_start > end)
+ break;
+
+ if (vma->vm_file ||
+ (vma->vm_flags & (VM_SPECIAL | VM_LOCKED)))
+ continue;
+
+ cond_resched();
+ nr_discarded +=
+ discard_vma_pages(zone, mm, vma,
+ max_t(unsigned long, start, vma->vm_start),
+ min_t(unsigned long, end + 1, vma->vm_end),
+ nr_to_discard);
+ }
+out_unlock:
+ up_read(&mm->mmap_sem);
+out:
+ return nr_discarded;
+}
+
+/*
+ * Get next victim vrange from LRU and hold a vrange refcount
+ * and vrange->mm's refcount.
+ */
+struct vrange *get_victim_vrange(void)
+{
+ struct mm_struct *mm;
+ struct vrange *vrange = NULL;
+ struct list_head *cur, *tmp;
+
+ spin_lock(&lru_lock);
+ list_for_each_prev_safe(cur, tmp, &lru_vrange) {
+ vrange = list_entry(cur, struct vrange, lru);
+ mm = vrange->mm;
+ /* the process is exiting so pass it */
+ if (atomic_read(&mm->mm_users) == 0) {
+ list_del_init(&vrange->lru);
+ vrange = NULL;
+ continue;
+ }
+
+ /* vrange is freeing so continue to loop */
+ if (!atomic_inc_not_zero(&vrange->refcount)) {
+ list_del_init(&vrange->lru);
+ vrange = NULL;
+ continue;
+ }
+
+ /*
+ * we need to access mmap_sem further routine so
+ * need to get a refcount of mm.
+ * NOTE: We guarantee mm_count isn't zero in here because
+ * if we found vrange from LRU list, it means we are
+ * before exit_vrange or remove_vrange.
+ */
+ atomic_inc(&mm->mm_count);
+
+ /* Isolate vrange */
+ list_del_init(&vrange->lru);
+ break;
+ }
+
+ spin_unlock(&lru_lock);
+ return vrange;
+}
+
+void put_victim_range(struct vrange *vrange)
+{
+ put_vrange(vrange);
+ mmdrop(vrange->mm);
+}
+
+unsigned int discard_vrange_pages(struct zone *zone, int nr_to_discard)
+{
+ struct vrange *vrange, *start_vrange;
+ unsigned int nr_discarded = 0;
+
+ start_vrange = vrange = get_victim_vrange();
+ if (start_vrange) {
+ struct mm_struct *mm = start_vrange->mm;
+ atomic_inc(&start_vrange->refcount);
+ atomic_inc(&mm->mm_count);
+ }
+
+ while (vrange) {
+ nr_discarded += discard_vrange(zone, vrange, nr_to_discard);
+ lru_add_vrange(vrange);
+ put_victim_range(vrange);
+
+ if (nr_discarded >= nr_to_discard)
+ break;
+
+ vrange = get_victim_vrange();
+ /* break if we go round the loop */
+ if (vrange == start_vrange) {
+ lru_add_vrange(vrange);
+ put_victim_range(vrange);
+ break;
+ }
+ }
+
+ if (start_vrange)
+ put_victim_range(start_vrange);
+
+ return nr_discarded;
+}
--
1.8.1.1

2013-03-12 07:40:04

by Minchan Kim

[permalink] [raw]
Subject: [RFC v7 09/11] Get rid of depenceny that all pages is from a zone in shrink_page_list

Now shrink_page_list expect all pages come from a same zone
but it's too limited to use it.

This patch removes the dependency and add may_discard in scan_control
so next patch can use shrink_page_list with pages from multiple zonnes.

Signed-off-by: Minchan Kim <[email protected]>
---
mm/vmscan.c | 13 +++++++++++--
1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 6ba4e8ea..e36ee51 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -77,6 +77,9 @@ struct scan_control {
/* Can pages be swapped as part of reclaim? */
int may_swap;

+ /* Discard pages in vrange */
+ int may_discard;
+
int order;

/* Scan (total_size >> priority) pages at once */
@@ -714,7 +717,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
goto keep;

VM_BUG_ON(PageActive(page));
- VM_BUG_ON(page_zone(page) != zone);
+ if (zone)
+ VM_BUG_ON(page_zone(page) != zone);

sc->nr_scanned++;

@@ -785,6 +789,10 @@ static unsigned long shrink_page_list(struct list_head *page_list,
; /* try to reclaim the page below */
}

+ /* Fail to discard a page and returns a page to caller */
+ if (sc->may_discard)
+ goto keep_locked;
+
/*
* Anonymous process memory has backing store?
* Try to allocate it some swap space here.
@@ -963,7 +971,8 @@ keep:
* back off and wait for congestion to clear because further reclaim
* will encounter the same problem
*/
- if (nr_dirty && nr_dirty == nr_congested && global_reclaim(sc))
+ if (nr_dirty && nr_dirty == nr_congested && global_reclaim(sc) &&
+ zone)
zone_set_flag(zone, ZONE_CONGESTED);

free_hot_cold_page_list(&free_pages, 1);
--
1.8.1.1

2013-03-12 07:40:38

by Minchan Kim

[permalink] [raw]
Subject: [RFC v7 06/11] send SIGBUS when user try to access purged page

By vrange(2) semantic, user should see SIGBUG if he try to access
purged page without vrange(...VRANGE_NOVOLATILE).

This patch implements it.

I reused PSE bit for quick prototype without enough considering
so need time to see what's empty bit and I am surely missing
many places to handle vrange pte bit. I should investigate all of
pte handling places, especially pte_none case. TODO

Signed-off-by: Minchan Kim <[email protected]>
---
arch/x86/include/asm/pgtable_types.h | 2 ++
include/asm-generic/pgtable.h | 11 +++++++++++
include/linux/vrange.h | 2 ++
mm/memory.c | 23 +++++++++++++++++++++--
mm/vrange.c | 26 ++++++++++++++++++++++++--
5 files changed, 60 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index 567b5d0..8c5163f 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -64,6 +64,8 @@
#define _PAGE_FILE (_AT(pteval_t, 1) << _PAGE_BIT_FILE)
#define _PAGE_PROTNONE (_AT(pteval_t, 1) << _PAGE_BIT_PROTNONE)

+#define _PAGE_VRANGE _PAGE_BIT_PSE
+
/*
* _PAGE_NUMA indicates that this page will trigger a numa hinting
* minor page fault to gather numa placement statistics (see
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index bfd8768..1486d42 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -469,6 +469,17 @@ static inline unsigned long my_zero_pfn(unsigned long addr)

#ifdef CONFIG_MMU

+static inline pte_t pte_mkvrange(pte_t pte)
+{
+ pte = pte_set_flags(pte, _PAGE_VRANGE);
+ return pte_clear_flags(pte, _PAGE_PRESENT);
+}
+
+static inline int pte_vrange(pte_t pte)
+{
+ return ((pte_flags(pte) | _PAGE_PRESENT) == _PAGE_VRANGE);
+}
+
#ifndef CONFIG_TRANSPARENT_HUGEPAGE
static inline int pmd_trans_huge(pmd_t pmd)
{
diff --git a/include/linux/vrange.h b/include/linux/vrange.h
index eb3f941..24ed4c1 100644
--- a/include/linux/vrange.h
+++ b/include/linux/vrange.h
@@ -41,6 +41,8 @@ int discard_vpage(struct page *page);
bool vrange_address(struct mm_struct *mm, unsigned long start,
unsigned long end);

+extern bool is_purged_vrange(struct mm_struct *mm, unsigned long address);
+
#else

static inline void vrange_init(void) {};
diff --git a/mm/memory.c b/mm/memory.c
index 494526a..cc369ab 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -59,6 +59,7 @@
#include <linux/gfp.h>
#include <linux/migrate.h>
#include <linux/string.h>
+#include <linux/vrange.h>

#include <asm/io.h>
#include <asm/pgalloc.h>
@@ -840,7 +841,7 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,

/* pte contains position in swap or file, so copy. */
if (unlikely(!pte_present(pte))) {
- if (!pte_file(pte)) {
+ if (!pte_file(pte) && !pte_vrange(pte)) {
swp_entry_t entry = pte_to_swp_entry(pte);

if (swap_duplicate(entry) < 0)
@@ -1180,7 +1181,7 @@ again:
if (pte_file(ptent)) {
if (unlikely(!(vma->vm_flags & VM_NONLINEAR)))
print_bad_pte(vma, addr, ptent, NULL);
- } else {
+ } else if (!pte_vrange(ptent)) {
swp_entry_t entry = pte_to_swp_entry(ptent);

if (!non_swap_entry(entry))
@@ -3663,9 +3664,27 @@ int handle_pte_fault(struct mm_struct *mm,
return do_linear_fault(mm, vma, address,
pte, pmd, flags, entry);
}
+anon:
return do_anonymous_page(mm, vma, address,
pte, pmd, flags);
}
+
+ if (unlikely(pte_vrange(entry))) {
+ if (!is_purged_vrange(mm, address)) {
+ /* zap pte */
+ ptl = pte_lockptr(mm, pmd);
+ spin_lock(ptl);
+ if (unlikely(!pte_same(*pte, entry)))
+ goto unlock;
+ flush_cache_page(vma, address, pte_pfn(*pte));
+ ptep_clear_flush(vma, address, pte);
+ pte_unmap_unlock(pte, ptl);
+ goto anon;
+ }
+
+ return VM_FAULT_SIGBUS;
+ }
+
if (pte_file(entry))
return do_nonlinear_fault(mm, vma, address,
pte, pmd, flags, entry);
diff --git a/mm/vrange.c b/mm/vrange.c
index 78aa252..89fcae4 100644
--- a/mm/vrange.c
+++ b/mm/vrange.c
@@ -343,7 +343,9 @@ int try_to_discard_one(struct page *page, struct vm_area_struct *vma,

present = pte_present(*pte);
flush_cache_page(vma, address, page_to_pfn(page));
- pteval = ptep_clear_flush(vma, address, pte);
+
+ ptep_clear_flush(vma, address, pte);
+ pteval = pte_mkvrange(*pte);

update_hiwater_rss(mm);
dec_mm_counter(mm, MM_ANONPAGES);
@@ -357,10 +359,12 @@ int try_to_discard_one(struct page *page, struct vm_area_struct *vma,
BUG_ON(1);
}

+ set_pte_at(mm, address, pte, pteval);
+ __vrange_purge(mm, address, address + PAGE_SIZE -1);
pte_unmap_unlock(pte, ptl);
mmu_notifier_invalidate_page(mm, address);
+ vrange_unlock(mm);
ret = 1;
- __vrange_purge(mm, address, address + PAGE_SIZE -1);
out:
return ret;
}
@@ -448,3 +452,21 @@ int discard_vpage(struct page *page)

return 0;
}
+
+bool is_purged_vrange(struct mm_struct *mm, unsigned long address)
+{
+ struct rb_root *root = &mm->v_rb;
+ struct interval_tree_node *node;
+ struct vrange *range;
+ bool ret = false;
+
+ vrange_lock(mm);
+ node = interval_tree_iter_first(root, address, address + PAGE_SIZE - 1);
+ if (node) {
+ range = container_of(node, struct vrange, node);
+ if (range->purged)
+ ret = true;
+ }
+ vrange_unlock(mm);
+ return ret;
+}
--
1.8.1.1

2013-03-12 07:40:52

by Minchan Kim

[permalink] [raw]
Subject: [RFC v7 05/11] Add purge operation

This patch adds discarding function part.
Logic is following as.

1. Memory pressure happens
2. VM start to reclaim anonymous page if system has a swap device
3. Check the page is in volatile range.
4. If so, zap the page from the process's page table.
(By sematic vrange(2), we should mark it with another one to
make page fault when you try to access the address. It will
be introduced later patch)
5. If page is unmapped from all processes, discard it instead of swapping.

Signed-off-by: Minchan Kim <[email protected]>
---
include/linux/rmap.h | 12 ++-
include/linux/swap.h | 1 +
include/linux/vrange.h | 9 ++
mm/ksm.c | 2 +-
mm/rmap.c | 23 +++--
mm/swapfile.c | 36 ++++++++
mm/vmscan.c | 16 +++-
mm/vrange.c | 235 +++++++++++++++++++++++++++++++++++++++++++++++++
8 files changed, 320 insertions(+), 14 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 6dacb93..6432dfb 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -83,6 +83,8 @@ enum ttu_flags {
};

#ifdef CONFIG_MMU
+unsigned long vma_address(struct page *page, struct vm_area_struct *vma);
+
static inline void get_anon_vma(struct anon_vma *anon_vma)
{
atomic_inc(&anon_vma->refcount);
@@ -182,9 +184,11 @@ static inline void page_dup_rmap(struct page *page)
* Called from mm/vmscan.c to handle paging out
*/
int page_referenced(struct page *, int is_locked,
- struct mem_cgroup *memcg, unsigned long *vm_flags);
+ struct mem_cgroup *memcg, unsigned long *vm_flags,
+ int *is_vrange);
int page_referenced_one(struct page *, struct vm_area_struct *,
- unsigned long address, unsigned int *mapcount, unsigned long *vm_flags);
+ unsigned long address, unsigned int *mapcount, unsigned long *vm_flags,
+ int *is_vrange);

#define TTU_ACTION(x) ((x) & TTU_ACTION_MASK)

@@ -249,9 +253,11 @@ int rmap_walk(struct page *page, int (*rmap_one)(struct page *,

static inline int page_referenced(struct page *page, int is_locked,
struct mem_cgroup *memcg,
- unsigned long *vm_flags)
+ unsigned long *vm_flags,
+ int *is_vrange)
{
*vm_flags = 0;
+ *is_vrange = 0;
return 0;
}

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 2818a12..bf56fb4 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -379,6 +379,7 @@ extern int swap_duplicate(swp_entry_t);
extern int swapcache_prepare(swp_entry_t);
extern void swap_free(swp_entry_t);
extern void swapcache_free(swp_entry_t, struct page *page);
+extern int __free_swap_and_cache(swp_entry_t);
extern int free_swap_and_cache(swp_entry_t);
extern int swap_type_of(dev_t, sector_t, struct block_device **);
extern unsigned int count_swap_pages(int, int);
diff --git a/include/linux/vrange.h b/include/linux/vrange.h
index 74b5e37..eb3f941 100644
--- a/include/linux/vrange.h
+++ b/include/linux/vrange.h
@@ -5,6 +5,9 @@
#include <linux/interval_tree.h>
#include <linux/mm.h>

+/* To protect race with forker */
+static DECLARE_RWSEM(vrange_fork_lock);
+
struct vrange {
struct interval_tree_node node;
bool purged;
@@ -34,6 +37,9 @@ static inline void vrange_unlock(struct mm_struct *mm)

extern void exit_vrange(struct mm_struct *mm);
void vrange_init(void);
+int discard_vpage(struct page *page);
+bool vrange_address(struct mm_struct *mm, unsigned long start,
+ unsigned long end);

#else

@@ -41,5 +47,8 @@ static inline void vrange_init(void) {};
static inline void mm_init_vrange(struct mm_struct *mm) {};
static inline void exit_vrange(struct mm_struct *mm);

+static inline bool vrange_address(struct mm_struct *mm, unsigned long start,
+ unsigned long end) { return false; };
+static inline int discard_vpage(struct page *page) { return 0 };
#endif
#endif /* _LINIUX_VRANGE_H */
diff --git a/mm/ksm.c b/mm/ksm.c
index b6afe0c..debc20c 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1932,7 +1932,7 @@ again:
continue;

referenced += page_referenced_one(page, vma,
- rmap_item->address, &mapcount, vm_flags);
+ rmap_item->address, &mapcount, vm_flags, NULL);
if (!search_new_forks || !mapcount)
break;
}
diff --git a/mm/rmap.c b/mm/rmap.c
index 807c96b..90cf51c 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -57,6 +57,8 @@
#include <linux/migrate.h>
#include <linux/hugetlb.h>
#include <linux/backing-dev.h>
+#include <linux/vrange.h>
+#include <linux/rmap.h>

#include <asm/tlbflush.h>

@@ -523,8 +525,7 @@ __vma_address(struct page *page, struct vm_area_struct *vma)
return vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
}

-inline unsigned long
-vma_address(struct page *page, struct vm_area_struct *vma)
+unsigned long vma_address(struct page *page, struct vm_area_struct *vma)
{
unsigned long address = __vma_address(page, vma);

@@ -662,7 +663,7 @@ int page_mapped_in_vma(struct page *page, struct vm_area_struct *vma)
*/
int page_referenced_one(struct page *page, struct vm_area_struct *vma,
unsigned long address, unsigned int *mapcount,
- unsigned long *vm_flags)
+ unsigned long *vm_flags, int *is_vrange)
{
struct mm_struct *mm = vma->vm_mm;
int referenced = 0;
@@ -724,6 +725,9 @@ int page_referenced_one(struct page *page, struct vm_area_struct *vma,
referenced++;
}
pte_unmap_unlock(pte, ptl);
+ if (is_vrange &&
+ vrange_address(mm, address, address + PAGE_SIZE -1))
+ *is_vrange = 1;
}

(*mapcount)--;
@@ -736,7 +740,8 @@ out:

static int page_referenced_anon(struct page *page,
struct mem_cgroup *memcg,
- unsigned long *vm_flags)
+ unsigned long *vm_flags,
+ int *is_vrange)
{
unsigned int mapcount;
struct anon_vma *anon_vma;
@@ -761,7 +766,7 @@ static int page_referenced_anon(struct page *page,
if (memcg && !mm_match_cgroup(vma->vm_mm, memcg))
continue;
referenced += page_referenced_one(page, vma, address,
- &mapcount, vm_flags);
+ &mapcount, vm_flags, is_vrange);
if (!mapcount)
break;
}
@@ -826,7 +831,7 @@ static int page_referenced_file(struct page *page,
if (memcg && !mm_match_cgroup(vma->vm_mm, memcg))
continue;
referenced += page_referenced_one(page, vma, address,
- &mapcount, vm_flags);
+ &mapcount, vm_flags, NULL);
if (!mapcount)
break;
}
@@ -841,6 +846,7 @@ static int page_referenced_file(struct page *page,
* @is_locked: caller holds lock on the page
* @memcg: target memory cgroup
* @vm_flags: collect encountered vma->vm_flags who actually referenced the page
+ * @is_vrange: the page in vrange of some process
*
* Quick test_and_clear_referenced for all mappings to a page,
* returns the number of ptes which referenced the page.
@@ -848,7 +854,8 @@ static int page_referenced_file(struct page *page,
int page_referenced(struct page *page,
int is_locked,
struct mem_cgroup *memcg,
- unsigned long *vm_flags)
+ unsigned long *vm_flags,
+ int *is_vrange)
{
int referenced = 0;
int we_locked = 0;
@@ -867,7 +874,7 @@ int page_referenced(struct page *page,
vm_flags);
else if (PageAnon(page))
referenced += page_referenced_anon(page, memcg,
- vm_flags);
+ vm_flags, is_vrange);
else if (page->mapping)
referenced += page_referenced_file(page, memcg,
vm_flags);
diff --git a/mm/swapfile.c b/mm/swapfile.c
index a1f7772..962024c 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -734,6 +734,42 @@ int try_to_free_swap(struct page *page)
}

/*
+ * It's almost same with free_swap_and_cache except page is already
+ * locked.
+ */
+int __free_swap_and_cache(swp_entry_t entry)
+{
+ struct swap_info_struct *p;
+ struct page *page = NULL;
+
+ if (non_swap_entry(entry))
+ return 1;
+
+ p = swap_info_get(entry);
+ if (p) {
+ if (swap_entry_free(p, entry, 1) == SWAP_HAS_CACHE) {
+ page = find_get_page(swap_address_space(entry),
+ entry.val);
+ }
+ spin_unlock(&swap_lock);
+ }
+
+ if (page) {
+ /*
+ * Not mapped elsewhere, or swap space full? Free it!
+ * Also recheck PageSwapCache now page is locked (above).
+ */
+ if (PageSwapCache(page) && !PageWriteback(page) &&
+ (!page_mapped(page) || vm_swap_full())) {
+ delete_from_swap_cache(page);
+ SetPageDirty(page);
+ }
+ page_cache_release(page);
+ }
+ return p != NULL;
+}
+
+/*
* Free the swap entry like above, but also try to
* free the page cache entry if it is the last user.
*/
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 88c5fed..6ba4e8ea 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -42,6 +42,7 @@
#include <linux/sysctl.h>
#include <linux/oom.h>
#include <linux/prefetch.h>
+#include <linux/vrange.h>

#include <asm/tlbflush.h>
#include <asm/div64.h>
@@ -610,6 +611,7 @@ enum page_references {
PAGEREF_RECLAIM,
PAGEREF_RECLAIM_CLEAN,
PAGEREF_KEEP,
+ PAGEREF_DISCARD,
PAGEREF_ACTIVATE,
};

@@ -618,9 +620,10 @@ static enum page_references page_check_references(struct page *page,
{
int referenced_ptes, referenced_page;
unsigned long vm_flags;
+ int is_vrange = 0;

referenced_ptes = page_referenced(page, 1, sc->target_mem_cgroup,
- &vm_flags);
+ &vm_flags, &is_vrange);
referenced_page = TestClearPageReferenced(page);

/*
@@ -630,6 +633,12 @@ static enum page_references page_check_references(struct page *page,
if (vm_flags & VM_LOCKED)
return PAGEREF_RECLAIM;

+ /*
+ * Bail out if the page is in vrange and try to discard.
+ */
+ if (is_vrange)
+ return PAGEREF_DISCARD;
+
if (referenced_ptes) {
if (PageSwapBacked(page))
return PAGEREF_ACTIVATE;
@@ -768,6 +777,9 @@ static unsigned long shrink_page_list(struct list_head *page_list,
goto activate_locked;
case PAGEREF_KEEP:
goto keep_locked;
+ case PAGEREF_DISCARD:
+ if (discard_vpage(page))
+ goto free_it;
case PAGEREF_RECLAIM:
case PAGEREF_RECLAIM_CLEAN:
; /* try to reclaim the page below */
@@ -1496,7 +1508,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
}

if (page_referenced(page, 0, sc->target_mem_cgroup,
- &vm_flags)) {
+ &vm_flags, NULL)) {
nr_rotated += hpage_nr_pages(page);
/*
* Identify referenced, file-backed active pages and
diff --git a/mm/vrange.c b/mm/vrange.c
index f8b6f0e..78aa252 100644
--- a/mm/vrange.c
+++ b/mm/vrange.c
@@ -6,6 +6,13 @@
#include <linux/slab.h>
#include <linux/syscalls.h>
#include <linux/mman.h>
+#include <linux/pagemap.h>
+#include <linux/rmap.h>
+#include <linux/hugetlb.h>
+#include "internal.h"
+#include <linux/swap.h>
+#include <linux/swapops.h>
+#include <linux/mmu_notifier.h>

static struct kmem_cache *vrange_cachep;

@@ -213,3 +220,231 @@ SYSCALL_DEFINE4(vrange, unsigned long, start,
out:
return ret;
}
+
+bool __vrange_address(struct mm_struct *mm,
+ unsigned long start, unsigned long end)
+{
+ struct rb_root *root = &mm->v_rb;
+ struct interval_tree_node *node;
+
+ node = interval_tree_iter_first(root, start, end);
+ return node ? true : false;
+}
+
+bool vrange_address(struct mm_struct *mm,
+ unsigned long start, unsigned long end)
+{
+ bool ret;
+
+ vrange_lock(mm);
+ ret = __vrange_address(mm, start, end);
+ vrange_unlock(mm);
+ return ret;
+}
+
+static pte_t *__vpage_check_address(struct page *page,
+ struct mm_struct *mm, unsigned long address, spinlock_t **ptlp)
+{
+ pmd_t *pmd;
+ pte_t *pte;
+ spinlock_t *ptl;
+ bool present;
+
+ /* TODO : look into tlbfs */
+ if (unlikely(PageHuge(page)))
+ return NULL;
+
+ pmd = mm_find_pmd(mm, address);
+ if (!pmd)
+ return NULL;
+ /*
+ * TODO : Support THP
+ */
+ if (pmd_trans_huge(*pmd))
+ return NULL;
+
+ pte = pte_offset_map_lock(mm, pmd, address, &ptl);
+ if (pte_none(*pte))
+ goto out;
+
+ present = pte_present(*pte);
+ if (present && page_to_pfn(page) != pte_pfn(*pte))
+ goto out;
+ else if (present) {
+ *ptlp = ptl;
+ return pte;
+ } else {
+ swp_entry_t entry = { .val = page_private(page) };
+
+ VM_BUG_ON(non_swap_entry(entry));
+ if (entry.val != pte_to_swp_entry(*pte).val)
+ goto out;
+ *ptlp = ptl;
+ return pte;
+ }
+out:
+ pte_unmap_unlock(pte, ptl);
+ return NULL;
+}
+
+/*
+ * This functions checks @page is matched with pte's encoded one
+ * which could be a page or swap slot.
+ */
+static inline pte_t *vpage_check_address(struct page *page,
+ struct mm_struct *mm, unsigned long address,
+ spinlock_t **ptlp)
+{
+ pte_t *ptep;
+ __cond_lock(*ptlp, ptep = __vpage_check_address(page,
+ mm, address, ptlp));
+ return ptep;
+}
+
+static void __vrange_purge(struct mm_struct *mm,
+ unsigned long start, unsigned long end)
+{
+ struct rb_root *root = &mm->v_rb;
+ struct vrange *range;
+ struct interval_tree_node *node;
+
+ node = interval_tree_iter_first(root, start, end);
+ while (node) {
+ range = container_of(node, struct vrange, node);
+ range->purged = true;
+ node = interval_tree_iter_next(node, start, end);
+ }
+}
+
+int try_to_discard_one(struct page *page, struct vm_area_struct *vma,
+ unsigned long address)
+{
+ struct mm_struct *mm = vma->vm_mm;
+ pte_t *pte;
+ pte_t pteval;
+ spinlock_t *ptl;
+ int ret = 0;
+ bool present;
+
+ VM_BUG_ON(!PageLocked(page));
+
+ vrange_lock(mm);
+ pte = vpage_check_address(page, mm, address, &ptl);
+ if (!pte) {
+ vrange_unlock(mm);
+ goto out;
+ }
+
+ if (vma->vm_flags & VM_LOCKED) {
+ pte_unmap_unlock(pte, ptl);
+ vrange_unlock(mm);
+ return 0;
+ }
+
+ present = pte_present(*pte);
+ flush_cache_page(vma, address, page_to_pfn(page));
+ pteval = ptep_clear_flush(vma, address, pte);
+
+ update_hiwater_rss(mm);
+ dec_mm_counter(mm, MM_ANONPAGES);
+
+ page_remove_rmap(page);
+ page_cache_release(page);
+ if (!present) {
+ swp_entry_t entry = pte_to_swp_entry(*pte);
+ dec_mm_counter(mm, MM_SWAPENTS);
+ if (unlikely(!__free_swap_and_cache(entry)))
+ BUG_ON(1);
+ }
+
+ pte_unmap_unlock(pte, ptl);
+ mmu_notifier_invalidate_page(mm, address);
+ ret = 1;
+ __vrange_purge(mm, address, address + PAGE_SIZE -1);
+out:
+ return ret;
+}
+
+static int try_to_discard_vpage(struct page *page)
+{
+ struct anon_vma *anon_vma;
+ struct anon_vma_chain *avc;
+ pgoff_t pgoff;
+ struct vm_area_struct *vma;
+ struct mm_struct *mm;
+ unsigned long address;
+ bool ret = 0;
+
+ anon_vma = page_lock_anon_vma_read(page);
+ if (!anon_vma)
+ return ret;
+
+ pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+ anon_vma_interval_tree_foreach(avc, &anon_vma->rb_root, pgoff, pgoff) {
+ pte_t *pte;
+ spinlock_t *ptl;
+
+ vma = avc->vma;
+ mm = vma->vm_mm;
+ address = vma_address(page, vma);
+
+ vrange_lock(mm);
+ /*
+ * We can't use page_check_address because it doesn't check
+ * swap entry of the page table. We need the check because
+ * we have to make sure atomicity of shared vrange.
+ * It means all vranges which are shared a page should be
+ * purged if a page in a process is purged.
+ */
+ pte = vpage_check_address(page, mm, address, &ptl);
+ if (!pte) {
+ vrange_unlock(mm);
+ continue;
+ }
+
+ if (vma->vm_flags & VM_LOCKED) {
+ pte_unmap_unlock(pte, ptl);
+ vrange_unlock(mm);
+ goto out;
+ }
+
+ pte_unmap_unlock(pte, ptl);
+ if (!__vrange_address(mm, address,
+ address + PAGE_SIZE - 1)) {
+ vrange_unlock(mm);
+ goto out;
+ }
+
+ vrange_unlock(mm);
+ }
+
+ anon_vma_interval_tree_foreach(avc, &anon_vma->rb_root, pgoff, pgoff) {
+ vma = avc->vma;
+ address = vma_address(page, vma);
+ if (!try_to_discard_one(page, vma, address))
+ goto out;
+ }
+
+ ret = 1;
+out:
+ page_unlock_anon_vma_read(anon_vma);
+ return ret;
+}
+
+int discard_vpage(struct page *page)
+{
+ VM_BUG_ON(!PageLocked(page));
+ VM_BUG_ON(PageLRU(page));
+
+ if (try_to_discard_vpage(page)) {
+ if (PageSwapCache(page))
+ try_to_free_swap(page);
+
+ if (page_freeze_refs(page, 1)) {
+ unlock_page(page);
+ return 1;
+ }
+ }
+
+ return 0;
+}
--
1.8.1.1

2013-03-12 07:38:55

by Minchan Kim

[permalink] [raw]
Subject: [RFC v7 02/11] add vrange basic data structure and functions

This patch adds vrange data structure(interval tree) and
related functions.

The vrange uses generic interval tree as main data structure
because it handles address range so generic interval tree
fits well for the purpose.

The add_vrange/remove_vrange are core functions for system call
will be introdcued next patch.

1. add_vrange inserts new address range into interval tree.
If new address range crosses over existing volatile range,
existing volatile range will be expanded to cover new range.
Then, if existing volatile range has purged state, new range
will have a purged state.
It's not good and we need more fine-grained purged state handling
in a vrange(TODO)

If new address range is inside existing range, we ignore it

2. remove_vrange removes address range
Then, return a purged state of the address ranges.

This patch copied some part from John Stultz's work but different semantic.

Signed-off-by: John Stultz <[email protected]>
Signed-off-by: Minchan Kim <[email protected]>
---
include/linux/mm_types.h | 5 ++
include/linux/vrange.h | 45 ++++++++++++++
init/main.c | 2 +
kernel/fork.c | 3 +
mm/Makefile | 2 +-
mm/vrange.c | 157 +++++++++++++++++++++++++++++++++++++++++++++++
6 files changed, 213 insertions(+), 1 deletion(-)
create mode 100644 include/linux/vrange.h
create mode 100644 mm/vrange.c

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index ace9a5f..080bf74 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -13,6 +13,7 @@
#include <linux/page-debug-flags.h>
#include <linux/uprobes.h>
#include <linux/page-flags-layout.h>
+#include <linux/mutex.h>
#include <asm/page.h>
#include <asm/mmu.h>

@@ -351,6 +352,10 @@ struct mm_struct {
*/


+#ifdef CONFIG_MMU
+ struct rb_root v_rb; /* vrange rb tree */
+ struct mutex v_lock; /* Protect v_rb */
+#endif
unsigned long hiwater_rss; /* High-watermark of RSS usage */
unsigned long hiwater_vm; /* High-water virtual memory usage */

diff --git a/include/linux/vrange.h b/include/linux/vrange.h
new file mode 100644
index 0000000..74b5e37
--- /dev/null
+++ b/include/linux/vrange.h
@@ -0,0 +1,45 @@
+#ifndef _LINUX_VRANGE_H
+#define _LINUX_VRANGE_H
+
+#include <linux/mutex.h>
+#include <linux/interval_tree.h>
+#include <linux/mm.h>
+
+struct vrange {
+ struct interval_tree_node node;
+ bool purged;
+};
+
+#define vrange_entry(ptr) \
+ container_of(ptr, struct vrange, node.rb)
+
+#ifdef CONFIG_MMU
+struct mm_struct;
+
+static inline void mm_init_vrange(struct mm_struct *mm)
+{
+ mm->v_rb = RB_ROOT;
+ mutex_init(&mm->v_lock);
+}
+
+static inline void vrange_lock(struct mm_struct *mm)
+{
+ mutex_lock(&mm->v_lock);
+}
+
+static inline void vrange_unlock(struct mm_struct *mm)
+{
+ mutex_unlock(&mm->v_lock);
+}
+
+extern void exit_vrange(struct mm_struct *mm);
+void vrange_init(void);
+
+#else
+
+static inline void vrange_init(void) {};
+static inline void mm_init_vrange(struct mm_struct *mm) {};
+static inline void exit_vrange(struct mm_struct *mm);
+
+#endif
+#endif /* _LINIUX_VRANGE_H */
diff --git a/init/main.c b/init/main.c
index 63534a1..0b9e0b5 100644
--- a/init/main.c
+++ b/init/main.c
@@ -72,6 +72,7 @@
#include <linux/ptrace.h>
#include <linux/blkdev.h>
#include <linux/elevator.h>
+#include <linux/vrange.h>

#include <asm/io.h>
#include <asm/bugs.h>
@@ -605,6 +606,7 @@ asmlinkage void __init start_kernel(void)
calibrate_delay();
pidmap_init();
anon_vma_init();
+ vrange_init();
#ifdef CONFIG_X86
if (efi_enabled(EFI_RUNTIME_SERVICES))
efi_enter_virtual_mode();
diff --git a/kernel/fork.c b/kernel/fork.c
index 8d932b1..e3aa120 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -70,6 +70,7 @@
#include <linux/khugepaged.h>
#include <linux/signalfd.h>
#include <linux/uprobes.h>
+#include <linux/vrange.h>

#include <asm/pgtable.h>
#include <asm/pgalloc.h>
@@ -541,6 +542,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p)
spin_lock_init(&mm->page_table_lock);
mm->free_area_cache = TASK_UNMAPPED_BASE;
mm->cached_hole_size = ~0UL;
+ mm_init_vrange(mm);
mm_init_aio(mm);
mm_init_owner(mm, p);

@@ -612,6 +614,7 @@ void mmput(struct mm_struct *mm)

if (atomic_dec_and_test(&mm->mm_users)) {
uprobe_clear_state(mm);
+ exit_vrange(mm);
exit_aio(mm);
ksm_exit(mm);
khugepaged_exit(mm); /* must run before exit_mmap */
diff --git a/mm/Makefile b/mm/Makefile
index 3a46287..a31235e 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -5,7 +5,7 @@
mmu-y := nommu.o
mmu-$(CONFIG_MMU) := fremap.o highmem.o madvise.o memory.o mincore.o \
mlock.o mmap.o mprotect.o mremap.o msync.o rmap.o \
- vmalloc.o pagewalk.o pgtable-generic.o
+ vmalloc.o pagewalk.o pgtable-generic.o vrange.o

ifdef CONFIG_CROSS_MEMORY_ATTACH
mmu-$(CONFIG_MMU) += process_vm_access.o
diff --git a/mm/vrange.c b/mm/vrange.c
new file mode 100644
index 0000000..e265c82
--- /dev/null
+++ b/mm/vrange.c
@@ -0,0 +1,157 @@
+/*
+ * mm/vrange.c
+ */
+
+#include <linux/vrange.h>
+#include <linux/slab.h>
+
+static struct kmem_cache *vrange_cachep;
+
+void __init vrange_init(void)
+{
+ vrange_cachep = KMEM_CACHE(vrange, SLAB_PANIC);
+}
+
+static inline void __set_vrange(struct vrange *range,
+ unsigned long start_idx, unsigned long end_idx)
+{
+ range->node.start = start_idx;
+ range->node.last = end_idx;
+}
+
+static void __add_range(struct vrange *range,
+ struct rb_root *root)
+{
+ interval_tree_insert(&range->node, root);
+}
+
+static void __remove_range(struct vrange *range,
+ struct rb_root *root)
+{
+ interval_tree_remove(&range->node, root);
+}
+
+static struct vrange *alloc_vrange(void)
+{
+ return kmem_cache_alloc(vrange_cachep, GFP_KERNEL);
+}
+
+static void free_vrange(struct vrange *range)
+{
+ kmem_cache_free(vrange_cachep, range);
+}
+
+static inline void range_resize(struct rb_root *root,
+ struct vrange *range,
+ unsigned long start, unsigned long end)
+{
+ __remove_range(range, root);
+ __set_vrange(range, start, end);
+ __add_range(range, root);
+}
+
+int add_vrange(struct mm_struct *mm,
+ unsigned long start, unsigned long end)
+{
+ struct rb_root *root;
+ struct vrange *new_range, *range;
+ struct interval_tree_node *node, *next;
+ int purged = 0;
+
+ new_range = alloc_vrange();
+ if (!new_range)
+ return -ENOMEM;
+
+ root = &mm->v_rb;
+ vrange_lock(mm);
+ node = interval_tree_iter_first(root, start, end);
+ while (node) {
+ next = interval_tree_iter_next(node, start, end);
+
+ range = container_of(node, struct vrange, node);
+ if (node->start < start && node->last > end) {
+ free_vrange(new_range);
+ goto out;
+ }
+
+ start = min_t(unsigned long, start, node->start);
+ end = max_t(unsigned long, end, node->last);
+
+ purged |= range->purged;
+ __remove_range(range, root);
+ free_vrange(range);
+
+ node = next;
+ }
+
+ __set_vrange(new_range, start, end);
+ new_range->purged = purged;
+
+ __add_range(new_range, root);
+out:
+ vrange_unlock(mm);
+ return 0;
+}
+
+int remove_vrange(struct mm_struct *mm,
+ unsigned long start, unsigned long end)
+{
+ struct rb_root *root;
+ struct vrange *new_range, *range;
+ struct interval_tree_node *node, *next;
+ int ret = 0;
+ bool used_new = false;
+
+ new_range = alloc_vrange();
+ if (!new_range)
+ return -ENOMEM;
+
+ root = &mm->v_rb;
+ vrange_lock(mm);
+
+ node = interval_tree_iter_first(root, start, end);
+ while (node) {
+ next = interval_tree_iter_next(node, start, end);
+
+ range = container_of(node, struct vrange, node);
+ ret |= range->purged;
+
+ if (start <= node->start && end >= node->last) {
+ __remove_range(range, root);
+ free_vrange(range);
+ } else if (node->start >= start) {
+ range_resize(root, range, end, node->last);
+ } else if (node->last <= end) {
+ range_resize(root, range, node->start, start);
+ } else {
+ used_new = true;
+ __set_vrange(new_range, end, node->last);
+ new_range->purged = range->purged;
+ range_resize(root, range, node->start, start);
+ __add_range(new_range, root);
+ break;
+ }
+
+ node = next;
+ }
+
+ vrange_unlock(mm);
+ if (!used_new)
+ free_vrange(new_range);
+
+ return ret;
+}
+
+void exit_vrange(struct mm_struct *mm)
+{
+ struct vrange *range;
+ struct rb_node *next;
+
+ next = rb_first(&mm->v_rb);
+ while (next) {
+ range = vrange_entry(next);
+ next = rb_next(next);
+ __remove_range(range, &mm->v_rb);
+ free_vrange(range);
+ }
+}
--
1.8.1.1

2013-03-12 07:41:16

by Minchan Kim

[permalink] [raw]
Subject: [RFC v7 04/11] add proc/pid/vrange information

Add vrange per perocess information.
It would help debugging.

Signed-off-by: Minchan Kim <[email protected]>
---
fs/proc/base.c | 1 +
fs/proc/internal.h | 6 +++
fs/proc/task_mmu.c | 129 +++++++++++++++++++++++++++++++++++++++++++++++++++++
mm/vrange.c | 2 +-
4 files changed, 137 insertions(+), 1 deletion(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 69078c7..c1a8506 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2523,6 +2523,7 @@ static const struct pid_entry tgid_base_stuff[] = {
ONE("stat", S_IRUGO, proc_tgid_stat),
ONE("statm", S_IRUGO, proc_pid_statm),
REG("maps", S_IRUGO, proc_pid_maps_operations),
+ REG("vrange", S_IRUGO, proc_pid_vrange_operations),
#ifdef CONFIG_NUMA
REG("numa_maps", S_IRUGO, proc_pid_numa_maps_operations),
#endif
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index 85ff3a4..0584035 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -60,6 +60,7 @@ extern loff_t mem_lseek(struct file *file, loff_t offset, int orig);

extern const struct file_operations proc_tid_children_operations;
extern const struct file_operations proc_pid_maps_operations;
+extern const struct file_operations proc_pid_vrange_operations;
extern const struct file_operations proc_tid_maps_operations;
extern const struct file_operations proc_pid_numa_maps_operations;
extern const struct file_operations proc_tid_numa_maps_operations;
@@ -82,6 +83,11 @@ struct proc_maps_private {
#endif
};

+struct proc_vrange_private {
+ struct pid *pid;
+ struct task_struct *task;
+};
+
void proc_init_inodecache(void);

static inline struct pid *proc_pid(struct inode *inode)
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 3e636d8..df009f0 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -11,6 +11,7 @@
#include <linux/rmap.h>
#include <linux/swap.h>
#include <linux/swapops.h>
+#include <linux/vrange.h>

#include <asm/elf.h>
#include <asm/uaccess.h>
@@ -370,6 +371,134 @@ static int show_tid_map(struct seq_file *m, void *v)
return show_map(m, v, 0);
}

+static void *v_start(struct seq_file *m, loff_t *pos)
+{
+ struct vrange *range;
+ struct mm_struct *mm;
+ struct rb_root *root;
+ struct rb_node *next;
+ struct proc_vrange_private *priv = m->private;
+ loff_t n = *pos;
+
+ /* Clear the per syscall fields in priv */
+ priv->task = NULL;
+
+ priv->task = get_pid_task(priv->pid, PIDTYPE_PID);
+ if (!priv->task)
+ return ERR_PTR(-ESRCH);
+
+ mm = mm_access(priv->task, PTRACE_MODE_READ);
+ if (!mm || IS_ERR(mm))
+ return mm;
+
+ vrange_lock(mm);
+ root = &mm->v_rb;
+
+ if (RB_EMPTY_ROOT(&mm->v_rb))
+ goto out;
+
+ next = rb_first(&mm->v_rb);
+ range = vrange_entry(next);
+ while(n > 0 && range) {
+ n--;
+ next = rb_next(next);
+ if (next)
+ range = vrange_entry(next);
+ else
+ range = NULL;
+ }
+ if (!n)
+ return range;
+out:
+ return NULL;
+}
+
+static void *v_next(struct seq_file *m, void *v, loff_t *pos)
+{
+ struct vrange *range = v;
+ struct rb_node *next;
+
+ (*pos)++;
+ next = rb_next(&range->node.rb);
+ if (next) {
+ range = vrange_entry(next);
+ return range;
+ }
+ return NULL;
+}
+
+static void v_stop(struct seq_file *m, void *v)
+{
+ struct proc_vrange_private *priv = m->private;
+ if (priv->task) {
+ struct mm_struct *mm = priv->task->mm;
+ vrange_unlock(mm);
+ mmput(mm);
+ put_task_struct(priv->task);
+ }
+}
+
+static int show_vrange(struct seq_file *m, void *v, int is_pid)
+{
+
+ unsigned long start, end;
+ bool purged;
+ struct vrange *range = v;
+
+ start = range->node.start;
+ end = range->node.last;
+ purged = range->purged;
+
+ seq_printf(m, "%08lx-%08lx %c\n",
+ start,
+ end,
+ purged ? 'p' : 'v');
+ return 0;
+}
+
+static int show_vrange_map(struct seq_file *m, void *v)
+{
+ return show_vrange(m, v, 1);
+}
+
+static const struct seq_operations proc_pid_vrange_op = {
+ .start = v_start,
+ .next = v_next,
+ .stop = v_stop,
+ .show = show_vrange_map
+};
+
+static int do_vrange_open(struct inode *inode, struct file *file,
+ const struct seq_operations *ops)
+{
+ struct proc_vrange_private *priv;
+ int ret = -ENOMEM;
+ priv = kzalloc(sizeof(*priv), GFP_KERNEL);
+ if (priv) {
+ priv->pid = proc_pid(inode);
+ ret = seq_open(file, ops);
+ if (!ret) {
+ struct seq_file *m = file->private_data;
+ m->private = priv;
+ } else {
+ kfree(priv);
+ }
+ }
+ return ret;
+}
+
+static int pid_vrange_open(struct inode *inode, struct file *file)
+{
+ return do_vrange_open(inode, file, &proc_pid_vrange_op);
+}
+
+const struct file_operations proc_pid_vrange_operations = {
+ .open = pid_vrange_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = seq_release_private,
+};
+
static const struct seq_operations proc_pid_maps_op = {
.start = m_start,
.next = m_next,
diff --git a/mm/vrange.c b/mm/vrange.c
index 2f77d89..f8b6f0e 100644
--- a/mm/vrange.c
+++ b/mm/vrange.c
@@ -199,7 +199,7 @@ SYSCALL_DEFINE4(vrange, unsigned long, start,
if (!len)
goto out;

- end = start len;
+ end = start + len;
if (end < start)
goto out;

--
1.8.1.1

2013-03-12 07:41:36

by Minchan Kim

[permalink] [raw]
Subject: [RFC v7 03/11] add new system call vrange(2)

This patch adds new system call sys_vrange.

NAME
vrange - give pin/unpin hint for kernel to help reclaim.

SYNOPSIS
int vrange(unsigned_long start, size_t length, int mode, int behavior);

DESCRIPTOIN
Applications can use vrange(2) to advise the kernel how it should
handle paging I/O in this VM area. The idea is to help the kernel
discard pages of vrange instead of reclaiming when memory pressure
happens. It means kernel doesn't discard any pages of vrange if ther is
no memory pressure.

mode:

VRANGE_VOLATILE
hint to kernel so VM can discard in vrange pages when
memory pressure happens.
VRANGE_NOVOLATILE
hint to kernel so VM doesn't discard vrange pages
any more.

behavior:

VRANGE_FULL_MODE
Once VM start to discard pages, it discards all pages
in a vrange.
VRANGE_PARTIAL_MODE
VM discards some pages of all vranges by round-robin
return values:

If user try to access purged memory without VRANGE_NOVOLATILE call,
he can encounter SIGBUS if the page was discarded by kernel.

RETURN VALUE
On success vrange returns zero or 1. zero means kernel doesn't discard
any pages on [start, start + length). 1 means kernel did discard
one of pages on the range.

ERRORS
EINVAL This error can occur for the following reasons:

* The value length is negative.
* addr is not page-aligned
* mode or behavior are not a vaild value.

ENOMEM Not enough memory

Signed-off-by: Minchan Kim <[email protected]>
---
arch/x86/syscalls/syscall_64.tbl | 1 +
include/uapi/asm-generic/mman-common.h | 5 +++
mm/vrange.c | 58 ++++++++++++++++++++++++++++++++++
3 files changed, 64 insertions(+)

diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index 38ae65d..dc332bd 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -320,6 +320,7 @@
311 64 process_vm_writev sys_process_vm_writev
312 common kcmp sys_kcmp
313 common finit_module sys_finit_module
+314 common vrange sys_vrange

#
# x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index 4164529..736696e 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -66,4 +66,9 @@
#define MAP_HUGE_SHIFT 26
#define MAP_HUGE_MASK 0x3f

+#define VRANGE_VOLATILE 0 /* unpin all pages so VM can discard them */
+#define VRANGE_NOVOLATILE 1 /* pin all pages so VM can't discard them */
+
+#define VRANGE_FULL_MODE 0 /* discard all pages of the range */
+#define VRANGE_PARTIAL_MODE 1 /* discard a few pages of the range */
#endif /* __ASM_GENERIC_MMAN_COMMON_H */
diff --git a/mm/vrange.c b/mm/vrange.c
index e265c82..2f77d89 100644
--- a/mm/vrange.c
+++ b/mm/vrange.c
@@ -4,6 +4,8 @@

#include <linux/vrange.h>
#include <linux/slab.h>
+#include <linux/syscalls.h>
+#include <linux/mman.h>

static struct kmem_cache *vrange_cachep;

@@ -155,3 +157,59 @@ void exit_vrange(struct mm_struct *mm)
free_vrange(range);
}
}
+
+/*
+ * The vrange(2) system call.
+ *
+ * Applications can use vrange() to advise the kernel how it should
+ * handle paging I/O in this VM area. The idea is to help the kernel
+ * discard pages of vrange instead of swapping out when memory pressure
+ * happens. The information provided is advisory only, and can be safely
+ * disregarded by the kernel if system has enough free memory.
+ *
+ * mode values:
+ * VRANGE_VOLATILE - hint to kernel so VM can discard vrange pages when
+ * memory pressure happens.
+ * VRANGE_NOVOLATILE - hint to kernel so VM doesn't discard vrange pages
+ * any more.
+ * behavior values:
+ *
+ * VRANGE_FULL_MODE - Once VM start to discard pages, it discards all pages
+ * in a vrange.
+ * VRANGE_PARTIAL_MODE - VM discards some pages of all vranges by round-robin
+ *
+ * return values:
+ * 0 - success and NOT purged.
+ * 1 - at least, one of pages [start, start + len) is discarded by VM.
+ * -EINVAL - start len < 0, start is not page-aligned, start is greater
+ * than TASK_SIZE or "mode" is not a valid value.
+ * -ENOMEM - Short of free memory in system for successful system call.
+ */
+SYSCALL_DEFINE4(vrange, unsigned long, start,
+ size_t, len, int, mode, int, behavior)
+{
+ unsigned long end;
+ struct mm_struct *mm = current->mm;
+ int ret = -EINVAL;
+
+ if (start & ~PAGE_MASK)
+ goto out;
+
+ len &= PAGE_MASK;
+ if (!len)
+ goto out;
+
+ end = start len;
+ if (end < start)
+ goto out;
+
+ if (start >= TASK_SIZE)
+ goto out;
+
+ if (mode == VRANGE_VOLATILE)
+ ret = add_vrange(mm, start, end - 1);
+ else if (mode == VRANGE_NOVOLATILE)
+ ret = remove_vrange(mm, start, end - 1);
+out:
+ return ret;
+}
--
1.8.1.1

2013-03-12 23:17:35

by Paul Turner

[permalink] [raw]
Subject: Re: [RFC v7 00/11] Support vrange for anonymous page

On Tue, Mar 12, 2013 at 12:38 AM, Minchan Kim <[email protected]> wrote:
> First of all, let's define the term.
> From now on, I'd like to call it as vrange(a.k.a volatile range)
> for anonymous page. If you have a better name in mind, please suggest.
>
> This version is still *RFC* because it's just quick prototype so
> it doesn't support THP/HugeTLB/KSM and even couldn't build on !x86.
> Before further sorting out issues, I'd like to post current direction
> and discuss it. Of course, I'd like to extend this discussion in
> comming LSF/MM.
>
> In this version, I changed lots of thing, expecially removed vma-based
> approach because it needs write-side lock for mmap_sem, which will drop
> performance in mutli-threaded big SMP system, KOSAKI pointed out.
> And vma-based approach is hard to meet requirement of new system call by
> John Stultz's suggested semantic for consistent purged handling.
> (http://linux-kernel.2935.n7.nabble.com/RFC-v5-0-8-Support-volatile-for-anonymous-range-tt575773.html#none)
>
> I tested this patchset with modified jemalloc allocator which was
> leaded by Jason Evans(jemalloc author) who was interest in this feature
> and was happy to port his allocator to use new system call.
> Super Thanks Jason!
>
> The benchmark for test is ebizzy. It have been used for testing the
> allocator performance so it's good for me. Again, thanks for recommending
> the benchmark, Jason.
> (http://people.freebsd.org/~kris/scaling/ebizzy.html)
>
> The result is good on my machine (12 CPU, 1.2GHz, DRAM 2G)
>
> ebizzy -S 20
>
> jemalloc-vanilla: 52389 records/sec
> jemalloc-vrange: 203414 records/sec
>
> ebizzy -S 20 with background memory pressure
>
> jemalloc-vanilla: 40746 records/sec
> jemalloc-vrange: 174910 records/sec
>
> And it's much improved on KVM virtual machine.
>
> This patchset is based on v3.9-rc2
>
> - What's the sys_vrange(addr, length, mode, behavior)?
>
> It's a hint that user deliver to kernel so kernel can *discard*
> pages in a range anytime. mode is one of VRANGE_VOLATILE and
> VRANGE_NOVOLATILE. VRANGE_NOVOLATILE is memory pin operation so
> kernel coudn't discard any pages any more while VRANGE_VOLATILE
> is memory unpin opeartion so kernel can discard pages in vrange
> anytime. At a moment, behavior is one of VRANGE_FULL and VRANGE
> PARTIAL. VRANGE_FULL tell kernel that once kernel decide to
> discard page in a vrange, please, discard all of pages in a
> vrange selected by victim vrange. VRANGE_PARTIAL tell kernel
> that please discard of some pages in a vrange. But now I didn't
> implemented VRANGE_PARTIAL handling yet.
>
> - What happens if user access page(ie, virtual address) discarded
> by kernel?
>
> The user can encounter SIGBUS.
>
> - What should user do for avoding SIGBUS?
> He should call vrange(addr, length, VRANGE_NOVOLATILE, mode) before
> accessing the range which was called
> vrange(addr, length, VRANGE_VOLATILE, mode)
>
> - What happens if user access page(ie, virtual address) doesn't
> discarded by kernel?
>
> The user can see vaild data which was there before calling
> vrange(., VRANGE_VOLATILE) without page fault.
>
> - What's different with madvise(DONTNEED)?
>
> System call semantic
>
> DONTNEED makes sure user always can see zero-fill pages after
> he calls madvise while vrange can see data or encounter SIGBUS.
>
> Internal implementation
>
> The madvise(DONTNEED) should zap all mapped pages in range so
> overhead is increased linearly with the number of mapped pages.
> Even, if user access zapped pages as write mode, page fault +
> page allocation + memset should be happened.
>
> The vrange just register a address range instead of zapping all of pte
> n the vma so it doesn't touch ptes any more.
>
> - What's the benefit compared to DONTNEED?
>
> 1. The system call overhead is smaller because vrange just registers
> a range using interval tree instead of zapping all the page in a range
> so overhead should be really cheap.
>
> 2. It has a chance to eliminate overheads (ex, zapping pte + page fault
> + page allocation + memset(PAGE_SIZE)) if memory pressure isn't
> severe.
>
> 3. It has a potential to zap all ptes and free the pages if memory
> pressure is severe so discard scanning overhead could be smaller - TODO
>
> - What's for targetting?
>
> Firstly, user-space allocator like ptmalloc, jemalloc or heap management
> of virtual machine like Dalvik. Also, it comes in handy for embedded
> which doesn't have swap device so they can't reclaim anonymous pages.
> By discarding instead of swapout, it could be used in the non-swap system.

I think that another potentially useful use-case would be using this
-- or a similar API -- to opportunistically return deep user stack
frames.

This is another place where we strongly care about the time-to-free as
well as the time-to-reallocate in the case of relatively immediate
re-use.

>
> Changelog from v6 - There are many changes.
> * Remove vma-based approach
> * Change system call semantic
> * Add more meaningful experiment
>
> Changelog from v5 - There are many changes.
>
> * Support CONFIG_VOLATILE_PAGE
> * Working with THP/KSM
> * Remove vma hacking logic in m[no]volatile system call
> * Discard page without swap cache
> * Kswapd discard volatile page so we can discard volatile pages
> although we don't have swap.
>
> Changelog from v4
>
> * Add new system call mvolatile/mnovolatile
> * Add sigbus when user try to access volatile range
> * Rebased on v3.7
> * Applied bug fix from John Stultz, Thanks!
>
> Changelog from v3
>
> * Removing madvise(addr, length, MADV_NOVOLATILE).
> * add vmstat about the number of discarded volatile pages
> * discard volatile pages without promotion in reclaim path
>
> Minchan Kim (11):
> vrange: enable generic interval tree
> add vrange basic data structure and functions
> add new system call vrange(2)
> add proc/pid/vrange information
> Add purge operation
> send SIGBUS when user try to access purged page
> keep mm_struct to vrange when system call context
> add LRU handling for victim vrange
> Get rid of depenceny that all pages is from a zone in shrink_page_list
> Purging vrange pages without swap
> add purged page information in vmstat
>
> arch/x86/include/asm/pgtable_types.h | 2 +
> arch/x86/syscalls/syscall_64.tbl | 1 +
> fs/proc/base.c | 1 +
> fs/proc/internal.h | 6 +
> fs/proc/task_mmu.c | 129 ++++++
> include/asm-generic/pgtable.h | 11 +
> include/linux/mm_types.h | 5 +
> include/linux/rmap.h | 15 +-
> include/linux/swap.h | 1 +
> include/linux/vm_event_item.h | 4 +
> include/linux/vrange.h | 59 +++
> include/uapi/asm-generic/mman-common.h | 5 +
> init/main.c | 2 +
> kernel/fork.c | 3 +
> lib/Makefile | 2 +-
> mm/Makefile | 2 +-
> mm/ksm.c | 2 +-
> mm/memory.c | 24 +-
> mm/rmap.c | 23 +-
> mm/swapfile.c | 36 ++
> mm/vmscan.c | 74 +++-
> mm/vmstat.c | 4 +
> mm/vrange.c | 754 +++++++++++++++++++++++++++++++++
> 23 files changed, 1143 insertions(+), 22 deletions(-)
> create mode 100644 include/linux/vrange.h
> create mode 100644 mm/vrange.c
>
> --
> 1.8.1.1
>

2013-03-13 06:45:07

by Minchan Kim

[permalink] [raw]
Subject: Re: [RFC v7 00/11] Support vrange for anonymous page

On Tue, Mar 12, 2013 at 04:16:57PM -0700, Paul Turner wrote:
> On Tue, Mar 12, 2013 at 12:38 AM, Minchan Kim <[email protected]> wrote:
> > First of all, let's define the term.
> > From now on, I'd like to call it as vrange(a.k.a volatile range)
> > for anonymous page. If you have a better name in mind, please suggest.
> >
> > This version is still *RFC* because it's just quick prototype so
> > it doesn't support THP/HugeTLB/KSM and even couldn't build on !x86.
> > Before further sorting out issues, I'd like to post current direction
> > and discuss it. Of course, I'd like to extend this discussion in
> > comming LSF/MM.
> >
> > In this version, I changed lots of thing, expecially removed vma-based
> > approach because it needs write-side lock for mmap_sem, which will drop
> > performance in mutli-threaded big SMP system, KOSAKI pointed out.
> > And vma-based approach is hard to meet requirement of new system call by
> > John Stultz's suggested semantic for consistent purged handling.
> > (http://linux-kernel.2935.n7.nabble.com/RFC-v5-0-8-Support-volatile-for-anonymous-range-tt575773.html#none)
> >
> > I tested this patchset with modified jemalloc allocator which was
> > leaded by Jason Evans(jemalloc author) who was interest in this feature
> > and was happy to port his allocator to use new system call.
> > Super Thanks Jason!
> >
> > The benchmark for test is ebizzy. It have been used for testing the
> > allocator performance so it's good for me. Again, thanks for recommending
> > the benchmark, Jason.
> > (http://people.freebsd.org/~kris/scaling/ebizzy.html)
> >
> > The result is good on my machine (12 CPU, 1.2GHz, DRAM 2G)
> >
> > ebizzy -S 20
> >
> > jemalloc-vanilla: 52389 records/sec
> > jemalloc-vrange: 203414 records/sec
> >
> > ebizzy -S 20 with background memory pressure
> >
> > jemalloc-vanilla: 40746 records/sec
> > jemalloc-vrange: 174910 records/sec
> >
> > And it's much improved on KVM virtual machine.
> >
> > This patchset is based on v3.9-rc2
> >
> > - What's the sys_vrange(addr, length, mode, behavior)?
> >
> > It's a hint that user deliver to kernel so kernel can *discard*
> > pages in a range anytime. mode is one of VRANGE_VOLATILE and
> > VRANGE_NOVOLATILE. VRANGE_NOVOLATILE is memory pin operation so
> > kernel coudn't discard any pages any more while VRANGE_VOLATILE
> > is memory unpin opeartion so kernel can discard pages in vrange
> > anytime. At a moment, behavior is one of VRANGE_FULL and VRANGE
> > PARTIAL. VRANGE_FULL tell kernel that once kernel decide to
> > discard page in a vrange, please, discard all of pages in a
> > vrange selected by victim vrange. VRANGE_PARTIAL tell kernel
> > that please discard of some pages in a vrange. But now I didn't
> > implemented VRANGE_PARTIAL handling yet.
> >
> > - What happens if user access page(ie, virtual address) discarded
> > by kernel?
> >
> > The user can encounter SIGBUS.
> >
> > - What should user do for avoding SIGBUS?
> > He should call vrange(addr, length, VRANGE_NOVOLATILE, mode) before
> > accessing the range which was called
> > vrange(addr, length, VRANGE_VOLATILE, mode)
> >
> > - What happens if user access page(ie, virtual address) doesn't
> > discarded by kernel?
> >
> > The user can see vaild data which was there before calling
> > vrange(., VRANGE_VOLATILE) without page fault.
> >
> > - What's different with madvise(DONTNEED)?
> >
> > System call semantic
> >
> > DONTNEED makes sure user always can see zero-fill pages after
> > he calls madvise while vrange can see data or encounter SIGBUS.
> >
> > Internal implementation
> >
> > The madvise(DONTNEED) should zap all mapped pages in range so
> > overhead is increased linearly with the number of mapped pages.
> > Even, if user access zapped pages as write mode, page fault +
> > page allocation + memset should be happened.
> >
> > The vrange just register a address range instead of zapping all of pte
> > n the vma so it doesn't touch ptes any more.
> >
> > - What's the benefit compared to DONTNEED?
> >
> > 1. The system call overhead is smaller because vrange just registers
> > a range using interval tree instead of zapping all the page in a range
> > so overhead should be really cheap.
> >
> > 2. It has a chance to eliminate overheads (ex, zapping pte + page fault
> > + page allocation + memset(PAGE_SIZE)) if memory pressure isn't
> > severe.
> >
> > 3. It has a potential to zap all ptes and free the pages if memory
> > pressure is severe so discard scanning overhead could be smaller - TODO
> >
> > - What's for targetting?
> >
> > Firstly, user-space allocator like ptmalloc, jemalloc or heap management
> > of virtual machine like Dalvik. Also, it comes in handy for embedded
> > which doesn't have swap device so they can't reclaim anonymous pages.
> > By discarding instead of swapout, it could be used in the non-swap system.
>
> I think that another potentially useful use-case would be using this
> -- or a similar API -- to opportunistically return deep user stack
> frames.
>
> This is another place where we strongly care about the time-to-free as
> well as the time-to-reallocate in the case of relatively immediate
> re-use.

Indeed. Great idea!
Thanks, Paul.

--
Kind regards,
Minchan Kim

2013-03-21 01:29:46

by John Stultz

[permalink] [raw]
Subject: Re: [RFC v7 00/11] Support vrange for anonymous page

On 03/12/2013 12:38 AM, Minchan Kim wrote:
> First of all, let's define the term.
> From now on, I'd like to call it as vrange(a.k.a volatile range)
> for anonymous page. If you have a better name in mind, please suggest.
>
> This version is still *RFC* because it's just quick prototype so
> it doesn't support THP/HugeTLB/KSM and even couldn't build on !x86.
> Before further sorting out issues, I'd like to post current direction
> and discuss it. Of course, I'd like to extend this discussion in
> comming LSF/MM.
>
> In this version, I changed lots of thing, expecially removed vma-based
> approach because it needs write-side lock for mmap_sem, which will drop
> performance in mutli-threaded big SMP system, KOSAKI pointed out.
> And vma-based approach is hard to meet requirement of new system call by
> John Stultz's suggested semantic for consistent purged handling.
> (http://linux-kernel.2935.n7.nabble.com/RFC-v5-0-8-Support-volatile-for-anonymous-range-tt575773.html#none)
>
> I tested this patchset with modified jemalloc allocator which was
> leaded by Jason Evans(jemalloc author) who was interest in this feature
> and was happy to port his allocator to use new system call.
> Super Thanks Jason!
>
> The benchmark for test is ebizzy. It have been used for testing the
> allocator performance so it's good for me. Again, thanks for recommending
> the benchmark, Jason.
> (http://people.freebsd.org/~kris/scaling/ebizzy.html)
>
> The result is good on my machine (12 CPU, 1.2GHz, DRAM 2G)
>
> ebizzy -S 20
>
> jemalloc-vanilla: 52389 records/sec
> jemalloc-vrange: 203414 records/sec
>
> ebizzy -S 20 with background memory pressure
>
> jemalloc-vanilla: 40746 records/sec
> jemalloc-vrange: 174910 records/sec
>
> And it's much improved on KVM virtual machine.
>
> This patchset is based on v3.9-rc2
>
> - What's the sys_vrange(addr, length, mode, behavior)?
>
> It's a hint that user deliver to kernel so kernel can *discard*
> pages in a range anytime. mode is one of VRANGE_VOLATILE and
> VRANGE_NOVOLATILE. VRANGE_NOVOLATILE is memory pin operation so
> kernel coudn't discard any pages any more while VRANGE_VOLATILE
> is memory unpin opeartion so kernel can discard pages in vrange
> anytime. At a moment, behavior is one of VRANGE_FULL and VRANGE
> PARTIAL. VRANGE_FULL tell kernel that once kernel decide to
> discard page in a vrange, please, discard all of pages in a
> vrange selected by victim vrange. VRANGE_PARTIAL tell kernel
> that please discard of some pages in a vrange. But now I didn't
> implemented VRANGE_PARTIAL handling yet.


So I'm very excited to see this new revision! Moving away from the VMA
based approach I think is really necessary, since managing the volatile
ranges on a per-mm basis really isn't going to work when we want shared
volatile ranges between processes (such as the shmem/tmpfs case Android
uses).

Just a few questions and observations from my initial playing around
with the patch:

1) So, I'm not sure I understand the benefit of VRANGE_PARTIAL. Why
would VRANGE_PARTIAL be useful?

2) I've got a trivial test program that I've used previously with ashmem
& my earlier file based efforts that allocates 26megs of page aligned
memory, and marks every other meg as volatile. Then it forks and the
child generates a ton of memory pressure, causing pages to be purged
(and the child killed by the OOM killer). Initially I didn't see my test
purging any pages with your patches. The problem of course was the
child's COW pages were not also marked volatile, so they could not be
purged. Once I over-wrote the data in the child, breaking the COW links,
the data in the parent was purged under pressure. This is good, because
it makes sure we don't purge cow pages if the volatility state isn't
consistent, but it also brings up a few questions:

- Should volatility be inherited on fork? If volatility is not
inherited on fork(), that could cause some strange behavior if the data
was purged prior to the fork, and also its not clear what the behavior
of the child should be with regards to data that was volatile at fork
time. However, we also don't want strange behavior on exec if
overwritten volatile pages were unexpectedly purged.

- At this moment, maybe not having thought it through enough, I'm
wondering if it makes sense to have volatility inherited on fork, but
cleared on exec? What are your thoughts here? Its been awhile, so I'm
not sure if that's consistent with my earlier comments on the topic.


3) Oddly, in my test case, once I changed the child to over-write the
volatile range and break the COW pages, the OOM killer more frequently
seems to favor killing the parent process, instead of the memory hogging
child process. I need to spend some more time looking at this, and I
know the OOM killer may go for the parent process sometimes, but it
definitely happens more frequently then when the COW pages are not
broken and no data is purged. Again, I need to dig in more here.


4) One of the harder aspects I'm trying to get my head around is how
your patches seem to use both the page list shrinkers (discard_vpage) to
purge ranges when particular pages selected, and a zone shrinker
(discard_vrange_pages) which manages its own lru of vranges. I get that
this is one way to handle purging anonymous pages when we are on a
swapless system, but the dual purging systems definitely make the code
harder to follow. Would something like my earlier attempts at changing
vmscan to shrink anonymous pages be simpler? Or is that just not going
to fly w/ the mm folks?


I'll continue working with the patches and try to get tmpfs support
added here soon.

Also, attached is a simple cleanup patch that you might want to fold in.

thanks
-john


Attachments:
0001-vrange-Make-various-vrange.c-local-functions-static.patch (2.84 kB)

2013-03-22 06:01:18

by Minchan Kim

[permalink] [raw]
Subject: Re: [RFC v7 00/11] Support vrange for anonymous page

On Wed, Mar 20, 2013 at 06:29:38PM -0700, John Stultz wrote:
> On 03/12/2013 12:38 AM, Minchan Kim wrote:
> >First of all, let's define the term.
> > From now on, I'd like to call it as vrange(a.k.a volatile range)
> >for anonymous page. If you have a better name in mind, please suggest.
> >
> >This version is still *RFC* because it's just quick prototype so
> >it doesn't support THP/HugeTLB/KSM and even couldn't build on !x86.
> >Before further sorting out issues, I'd like to post current direction
> >and discuss it. Of course, I'd like to extend this discussion in
> >comming LSF/MM.
> >
> >In this version, I changed lots of thing, expecially removed vma-based
> >approach because it needs write-side lock for mmap_sem, which will drop
> >performance in mutli-threaded big SMP system, KOSAKI pointed out.
> >And vma-based approach is hard to meet requirement of new system call by
> >John Stultz's suggested semantic for consistent purged handling.
> >(http://linux-kernel.2935.n7.nabble.com/RFC-v5-0-8-Support-volatile-for-anonymous-range-tt575773.html#none)
> >
> >I tested this patchset with modified jemalloc allocator which was
> >leaded by Jason Evans(jemalloc author) who was interest in this feature
> >and was happy to port his allocator to use new system call.
> >Super Thanks Jason!
> >
> >The benchmark for test is ebizzy. It have been used for testing the
> >allocator performance so it's good for me. Again, thanks for recommending
> >the benchmark, Jason.
> >(http://people.freebsd.org/~kris/scaling/ebizzy.html)
> >
> >The result is good on my machine (12 CPU, 1.2GHz, DRAM 2G)
> >
> > ebizzy -S 20
> >
> >jemalloc-vanilla: 52389 records/sec
> >jemalloc-vrange: 203414 records/sec
> >
> > ebizzy -S 20 with background memory pressure
> >
> >jemalloc-vanilla: 40746 records/sec
> >jemalloc-vrange: 174910 records/sec
> >
> >And it's much improved on KVM virtual machine.
> >
> >This patchset is based on v3.9-rc2
> >
> >- What's the sys_vrange(addr, length, mode, behavior)?
> >
> > It's a hint that user deliver to kernel so kernel can *discard*
> > pages in a range anytime. mode is one of VRANGE_VOLATILE and
> > VRANGE_NOVOLATILE. VRANGE_NOVOLATILE is memory pin operation so
> > kernel coudn't discard any pages any more while VRANGE_VOLATILE
> > is memory unpin opeartion so kernel can discard pages in vrange
> > anytime. At a moment, behavior is one of VRANGE_FULL and VRANGE
> > PARTIAL. VRANGE_FULL tell kernel that once kernel decide to
> > discard page in a vrange, please, discard all of pages in a
> > vrange selected by victim vrange. VRANGE_PARTIAL tell kernel
> > that please discard of some pages in a vrange. But now I didn't
> > implemented VRANGE_PARTIAL handling yet.
>
>
> So I'm very excited to see this new revision! Moving away from the
> VMA based approach I think is really necessary, since managing the
> volatile ranges on a per-mm basis really isn't going to work when we
> want shared volatile ranges between processes (such as the
> shmem/tmpfs case Android uses).
>
> Just a few questions and observations from my initial playing around
> with the patch:
>
> 1) So, I'm not sure I understand the benefit of VRANGE_PARTIAL. Why
> would VRANGE_PARTIAL be useful?

For exmaple, some process makes 64M vranges and now kernel needs 8M
pages to flee from memory pressure state. In this case, we don't need
to discard 64M all at once because if we discard only 8M page, the cost
of allocator is (8M/4K) * page(falut + allocation + zero-clearing)
while (64M/4K) * page(falut + allocation + zero-clearing), otherwise.

If it were temporal image extracted on some compressed format, it's not
easy to regenerate punched hole data from original source so it would
be better to discard all pages in the vrange, which will be very far
from memory reclaimer.

>
> 2) I've got a trivial test program that I've used previously with
> ashmem & my earlier file based efforts that allocates 26megs of page
> aligned memory, and marks every other meg as volatile. Then it forks
> and the child generates a ton of memory pressure, causing pages to
> be purged (and the child killed by the OOM killer). Initially I
> didn't see my test purging any pages with your patches. The problem
> of course was the child's COW pages were not also marked volatile,
> so they could not be purged. Once I over-wrote the data in the
> child, breaking the COW links, the data in the parent was purged
> under pressure. This is good, because it makes sure we don't purge
> cow pages if the volatility state isn't consistent, but it also
> brings up a few questions:
>
> - Should volatility be inherited on fork? If volatility is not
> inherited on fork(), that could cause some strange behavior if the
> data was purged prior to the fork, and also its not clear what the
> behavior of the child should be with regards to data that was
> volatile at fork time. However, we also don't want strange behavior
> on exec if overwritten volatile pages were unexpectedly purged.

I don't know why we should inherit volatility to child at least, for
anon vrange. Because it's not proper way to share the data.
For data sharing for anonymous page, we should use shmem so the work
could be done when we work tmpfs work, I guess.

Currently, I implemented it to protect only COW pages.
If the data was purged prio to fork, the page should be never mapped
logically so child should see newly zero-cleared page if he try to access
the address. But you pointed out the bug, I should have handled it in
copy_one_pte. I guess the bug might cause OOM kill for parent by wrong
rss count. I will fix it.

I'm not sure it could be a good answer for your question because
I couldn't understand your question fully.
If my answer isn't enough, could you elaborate it more?

>
> - At this moment, maybe not having thought it through enough,
> I'm wondering if it makes sense to have volatility inherited on
> fork, but cleared on exec? What are your thoughts here? Its been
> awhile, so I'm not sure if that's consistent with my earlier
> comments on the topic.

I already said my opinion above.

>
>
> 3) Oddly, in my test case, once I changed the child to over-write
> the volatile range and break the COW pages, the OOM killer more
> frequently seems to favor killing the parent process, instead of the
> memory hogging child process. I need to spend some more time looking
> at this, and I know the OOM killer may go for the parent process
> sometimes, but it definitely happens more frequently then when the
> COW pages are not broken and no data is purged. Again, I need to dig
> in more here.

It should be a problem wrong RSS count.
Could you send test program? I will fix it if you don't have enough time.

>
>
> 4) One of the harder aspects I'm trying to get my head around is how
> your patches seem to use both the page list shrinkers
> (discard_vpage) to purge ranges when particular pages selected, and
> a zone shrinker (discard_vrange_pages) which manages its own lru of
> vranges. I get that this is one way to handle purging anonymous
> pages when we are on a swapless system, but the dual purging systems
> definitely make the code harder to follow. Would something like my

discard_vpage is for avoiding swapping out in direct reclaim path
when kswapd miss the page.

discard_vrange_pages is for handling volatile pages as top prioirty
prio to reclaim non-volatile pages.

I think it's very clear, NOT to understand. :)
And discard_vpage is basic core function to discard volatile page
so it could be used many places.

> earlier attempts at changing vmscan to shrink anonymous pages be
> simpler? Or is that just not going to fly w/ the mm folks?

There were many attempt at old. Could you point out?
>
>
> I'll continue working with the patches and try to get tmpfs support
> added here soon.
>
> Also, attached is a simple cleanup patch that you might want to fold in.

Thanks, John!

>
> thanks
> -john
>

> >From 10f50e53ae706d61591b3247bc494b47a79f2b69 Mon Sep 17 00:00:00 2001
> From: John Stultz <[email protected]>
> Date: Wed, 20 Mar 2013 18:24:56 -0700
> Subject: [PATCH] vrange: Make various vrange.c local functions static
>
> Make a number of local functions in vrange.c static.
>
> Signed-off-by: John Stultz <[email protected]>
> ---
> mm/vrange.c | 18 +++++++++---------
> 1 file changed, 9 insertions(+), 9 deletions(-)
>
> diff --git a/mm/vrange.c b/mm/vrange.c
> index c0c5d50..d07884d 100644
> --- a/mm/vrange.c
> +++ b/mm/vrange.c
> @@ -45,7 +45,7 @@ static inline void __set_vrange(struct vrange *range,
> range->node.last = end_idx;
> }
>
> -void lru_add_vrange(struct vrange *vrange)
> +static void lru_add_vrange(struct vrange *vrange)
> {
> spin_lock(&lru_lock);
> WARN_ON(!list_empty(&vrange->lru));
> @@ -53,7 +53,7 @@ void lru_add_vrange(struct vrange *vrange)
> spin_unlock(&lru_lock);
> }
>
> -void lru_remove_vrange(struct vrange *vrange)
> +static void lru_remove_vrange(struct vrange *vrange)
> {
> spin_lock(&lru_lock);
> if (!list_empty(&vrange->lru))
> @@ -130,7 +130,7 @@ static inline void range_resize(struct rb_root *root,
> __add_range(range, root, mm);
> }
>
> -int add_vrange(struct mm_struct *mm,
> +static int add_vrange(struct mm_struct *mm,
> unsigned long start, unsigned long end)
> {
> struct rb_root *root;
> @@ -172,7 +172,7 @@ out:
> return 0;
> }
>
> -int remove_vrange(struct mm_struct *mm,
> +static int remove_vrange(struct mm_struct *mm,
> unsigned long start, unsigned long end)
> {
> struct rb_root *root;
> @@ -292,7 +292,7 @@ out:
> return ret;
> }
>
> -bool __vrange_address(struct mm_struct *mm,
> +static bool __vrange_address(struct mm_struct *mm,
> unsigned long start, unsigned long end)
> {
> struct rb_root *root = &mm->v_rb;
> @@ -387,7 +387,7 @@ static void __vrange_purge(struct mm_struct *mm,
> }
> }
>
> -int try_to_discard_one(struct page *page, struct vm_area_struct *vma,
> +static int try_to_discard_one(struct page *page, struct vm_area_struct *vma,
> unsigned long address)
> {
> struct mm_struct *mm = vma->vm_mm;
> @@ -602,7 +602,7 @@ static int vrange_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
>
> }
>
> -unsigned int discard_vma_pages(struct zone *zone, struct mm_struct *mm,
> +static unsigned int discard_vma_pages(struct zone *zone, struct mm_struct *mm,
> struct vm_area_struct *vma, unsigned long start,
> unsigned long end, unsigned int nr_to_discard)
> {
> @@ -669,7 +669,7 @@ out:
> * Get next victim vrange from LRU and hold a vrange refcount
> * and vrange->mm's refcount.
> */
> -struct vrange *get_victim_vrange(void)
> +static struct vrange *get_victim_vrange(void)
> {
> struct mm_struct *mm;
> struct vrange *vrange = NULL;
> @@ -711,7 +711,7 @@ struct vrange *get_victim_vrange(void)
> return vrange;
> }
>
> -void put_victim_range(struct vrange *vrange)
> +static void put_victim_range(struct vrange *vrange)
> {
> put_vrange(vrange);
> mmdrop(vrange->mm);
> --
> 1.7.10.4
>


--
Kind regards,
Minchan Kim

2013-03-22 17:07:05

by John Stultz

[permalink] [raw]
Subject: Re: [RFC v7 00/11] Support vrange for anonymous page

On 03/21/2013 11:01 PM, Minchan Kim wrote:
> On Wed, Mar 20, 2013 at 06:29:38PM -0700, John Stultz wrote:
>> On 03/12/2013 12:38 AM, Minchan Kim wrote:
>>> First of all, let's define the term.
>>> From now on, I'd like to call it as vrange(a.k.a volatile range)
>>> for anonymous page. If you have a better name in mind, please suggest.
>>>
>>> This version is still *RFC* because it's just quick prototype so
>>> it doesn't support THP/HugeTLB/KSM and even couldn't build on !x86.
>>> Before further sorting out issues, I'd like to post current direction
>>> and discuss it. Of course, I'd like to extend this discussion in
>>> comming LSF/MM.
>>>
>>> In this version, I changed lots of thing, expecially removed vma-based
>>> approach because it needs write-side lock for mmap_sem, which will drop
>>> performance in mutli-threaded big SMP system, KOSAKI pointed out.
>>> And vma-based approach is hard to meet requirement of new system call by
>>> John Stultz's suggested semantic for consistent purged handling.
>>> (http://linux-kernel.2935.n7.nabble.com/RFC-v5-0-8-Support-volatile-for-anonymous-range-tt575773.html#none)
>>>
>>> I tested this patchset with modified jemalloc allocator which was
>>> leaded by Jason Evans(jemalloc author) who was interest in this feature
>>> and was happy to port his allocator to use new system call.
>>> Super Thanks Jason!
>>>
>>> The benchmark for test is ebizzy. It have been used for testing the
>>> allocator performance so it's good for me. Again, thanks for recommending
>>> the benchmark, Jason.
>>> (http://people.freebsd.org/~kris/scaling/ebizzy.html)
>>>
>>> The result is good on my machine (12 CPU, 1.2GHz, DRAM 2G)
>>>
>>> ebizzy -S 20
>>>
>>> jemalloc-vanilla: 52389 records/sec
>>> jemalloc-vrange: 203414 records/sec
>>>
>>> ebizzy -S 20 with background memory pressure
>>>
>>> jemalloc-vanilla: 40746 records/sec
>>> jemalloc-vrange: 174910 records/sec
>>>
>>> And it's much improved on KVM virtual machine.
>>>
>>> This patchset is based on v3.9-rc2
>>>
>>> - What's the sys_vrange(addr, length, mode, behavior)?
>>>
>>> It's a hint that user deliver to kernel so kernel can *discard*
>>> pages in a range anytime. mode is one of VRANGE_VOLATILE and
>>> VRANGE_NOVOLATILE. VRANGE_NOVOLATILE is memory pin operation so
>>> kernel coudn't discard any pages any more while VRANGE_VOLATILE
>>> is memory unpin opeartion so kernel can discard pages in vrange
>>> anytime. At a moment, behavior is one of VRANGE_FULL and VRANGE
>>> PARTIAL. VRANGE_FULL tell kernel that once kernel decide to
>>> discard page in a vrange, please, discard all of pages in a
>>> vrange selected by victim vrange. VRANGE_PARTIAL tell kernel
>>> that please discard of some pages in a vrange. But now I didn't
>>> implemented VRANGE_PARTIAL handling yet.
>>
>> So I'm very excited to see this new revision! Moving away from the
>> VMA based approach I think is really necessary, since managing the
>> volatile ranges on a per-mm basis really isn't going to work when we
>> want shared volatile ranges between processes (such as the
>> shmem/tmpfs case Android uses).
>>
>> Just a few questions and observations from my initial playing around
>> with the patch:
>>
>> 1) So, I'm not sure I understand the benefit of VRANGE_PARTIAL. Why
>> would VRANGE_PARTIAL be useful?
> For exmaple, some process makes 64M vranges and now kernel needs 8M
> pages to flee from memory pressure state. In this case, we don't need
> to discard 64M all at once because if we discard only 8M page, the cost
> of allocator is (8M/4K) * page(falut + allocation + zero-clearing)
> while (64M/4K) * page(falut + allocation + zero-clearing), otherwise.
>
> If it were temporal image extracted on some compressed format, it's not
> easy to regenerate punched hole data from original source so it would
> be better to discard all pages in the vrange, which will be very far
> from memory reclaimer.

So, if I understand you properly, its more an issue of the the added
cost of making the purged range non-volatile, and re-faulting in the
pages if we purge them all, when we didn't actually have the memory
pressure to warrant purging the entire range?

Hrm. Ok, I can sort of see that.

So if we do partial-purging, all the data in the range is invalid -
since we don't know which pages in particular were purged, but the costs
when marking the range non-volatile and the costs of over-writing the
pages with the re-created data will be slightly cheaper.

I guess the other benefit is if you're using the SIGBUS semantics, you
might luck out and not actually touch a purged page. Where as if the
entire range is purged, the process will definitely hit the SIGBUS if
its accessing the volatile data.


So yea, its starting to make sense.

Much of my earlier confusion comes from comment in the vrange syscall
implementation that suggests VRANGE_PARTIAL will purge from ranges
intentionally in round-robin order, which I think is probably not
advantageous, as it will invalidate more ranges causing more overhead.
Instead using the normal page eviction order with _PARTIAL would
probably be best.


>> 2) I've got a trivial test program that I've used previously with
>> ashmem & my earlier file based efforts that allocates 26megs of page
>> aligned memory, and marks every other meg as volatile. Then it forks
>> and the child generates a ton of memory pressure, causing pages to
>> be purged (and the child killed by the OOM killer). Initially I
>> didn't see my test purging any pages with your patches. The problem
>> of course was the child's COW pages were not also marked volatile,
>> so they could not be purged. Once I over-wrote the data in the
>> child, breaking the COW links, the data in the parent was purged
>> under pressure. This is good, because it makes sure we don't purge
>> cow pages if the volatility state isn't consistent, but it also
>> brings up a few questions:
>>
>> - Should volatility be inherited on fork? If volatility is not
>> inherited on fork(), that could cause some strange behavior if the
>> data was purged prior to the fork, and also its not clear what the
>> behavior of the child should be with regards to data that was
>> volatile at fork time. However, we also don't want strange behavior
>> on exec if overwritten volatile pages were unexpectedly purged.
> I don't know why we should inherit volatility to child at least, for
> anon vrange. Because it's not proper way to share the data.
> For data sharing for anonymous page, we should use shmem so the work
> could be done when we work tmpfs work, I guess.

I'm not suggesting the volatile *pages* on fork would be shared (other
then they are COW), instead my point is the volatile *state* of the
pages should probably be preserved over a fork.

Given the following example:

buf = malloc(BIGBUF);
memset(buf, 'P', BIGBUF);
vrange(buf, BIGBUF, VRANGE_VOLATILE, VRANGE_FULL);
pid = fork();

if (!pid) /* break COW sharing*/
memset(buf, 'C', BIGBUF);

generate_memory_pressure();
purged = vrange(buf, BIGBUF, VRANGE_NOVOLATILE, VRANGE_FULL);


Currently, because vrange is set before the fork, in this example, only
the parent's volatile range will be purged. However, if we were to move
the fork() one line up, then both parent and child would see their
separate ranges purged. This behavior is not quite intuitive, as I
usually expect the childs state to be identical to the parents at fork time.

In my mind, what would be less surprising is if in the above code, the
volatility state of buf would be inherited to the child as well
(basically copying the vrange tree at fork).

And the cow breaking in the above is just for clarification, even if the
COW links weren't broken and the pages were still shared between the
child and parent after fork, since they both would consider the buffer
state as volatile, it would still be ok to purge the pages.

Now, the other side of the coin, is that if we have volatile data at
fork time, but the child then goes on to call exec, we don't want the
new process to randomly hit sigfaults when the earlier set volatile
range is purged. So if we inherit volatile state to children, my thought
is we probably want to clear all volatile state on exec.




>>
>> 4) One of the harder aspects I'm trying to get my head around is how
>> your patches seem to use both the page list shrinkers
>> (discard_vpage) to purge ranges when particular pages selected, and
>> a zone shrinker (discard_vrange_pages) which manages its own lru of
>> vranges. I get that this is one way to handle purging anonymous
>> pages when we are on a swapless system, but the dual purging systems
>> definitely make the code harder to follow. Would something like my
> discard_vpage is for avoiding swapping out in direct reclaim path
> when kswapd miss the page.
>
> discard_vrange_pages is for handling volatile pages as top prioirty
> prio to reclaim non-volatile pages.

So one note: while I've pushed for freeing volatile pages first in the
past, I know Mel has had some objections to this, for instance, he
thought there are cases where freeing the volatile data first wasn't the
right thing to do, such as the case with streaming data, and that we
probably want to leave it to the page eviction LRUs to pick the pages
for us.


>
> I think it's very clear, NOT to understand. :)
> And discard_vpage is basic core function to discard volatile page
> so it could be used many places.

Ok, I suspect it will make more sense as I get more familiar with it. :)

>
>> earlier attempts at changing vmscan to shrink anonymous pages be
>> simpler? Or is that just not going to fly w/ the mm folks?
> There were many attempt at old. Could you point out?

https://lkml.org/lkml/2012/6/12/587
Although I know you had objections to my specific implementation, since
it kept non-volatile anonymous pages on the active list.



thanks
-john

2013-03-25 08:42:23

by Minchan Kim

[permalink] [raw]
Subject: Re: [RFC v7 00/11] Support vrange for anonymous page

On Fri, Mar 22, 2013 at 10:06:56AM -0700, John Stultz wrote:
> On 03/21/2013 11:01 PM, Minchan Kim wrote:
> >On Wed, Mar 20, 2013 at 06:29:38PM -0700, John Stultz wrote:
> >>On 03/12/2013 12:38 AM, Minchan Kim wrote:
> >>>First of all, let's define the term.
> >>> From now on, I'd like to call it as vrange(a.k.a volatile range)
> >>>for anonymous page. If you have a better name in mind, please suggest.
> >>>
> >>>This version is still *RFC* because it's just quick prototype so
> >>>it doesn't support THP/HugeTLB/KSM and even couldn't build on !x86.
> >>>Before further sorting out issues, I'd like to post current direction
> >>>and discuss it. Of course, I'd like to extend this discussion in
> >>>comming LSF/MM.
> >>>
> >>>In this version, I changed lots of thing, expecially removed vma-based
> >>>approach because it needs write-side lock for mmap_sem, which will drop
> >>>performance in mutli-threaded big SMP system, KOSAKI pointed out.
> >>>And vma-based approach is hard to meet requirement of new system call by
> >>>John Stultz's suggested semantic for consistent purged handling.
> >>>(http://linux-kernel.2935.n7.nabble.com/RFC-v5-0-8-Support-volatile-for-anonymous-range-tt575773.html#none)
> >>>
> >>>I tested this patchset with modified jemalloc allocator which was
> >>>leaded by Jason Evans(jemalloc author) who was interest in this feature
> >>>and was happy to port his allocator to use new system call.
> >>>Super Thanks Jason!
> >>>
> >>>The benchmark for test is ebizzy. It have been used for testing the
> >>>allocator performance so it's good for me. Again, thanks for recommending
> >>>the benchmark, Jason.
> >>>(http://people.freebsd.org/~kris/scaling/ebizzy.html)
> >>>
> >>>The result is good on my machine (12 CPU, 1.2GHz, DRAM 2G)
> >>>
> >>> ebizzy -S 20
> >>>
> >>>jemalloc-vanilla: 52389 records/sec
> >>>jemalloc-vrange: 203414 records/sec
> >>>
> >>> ebizzy -S 20 with background memory pressure
> >>>
> >>>jemalloc-vanilla: 40746 records/sec
> >>>jemalloc-vrange: 174910 records/sec
> >>>
> >>>And it's much improved on KVM virtual machine.
> >>>
> >>>This patchset is based on v3.9-rc2
> >>>
> >>>- What's the sys_vrange(addr, length, mode, behavior)?
> >>>
> >>> It's a hint that user deliver to kernel so kernel can *discard*
> >>> pages in a range anytime. mode is one of VRANGE_VOLATILE and
> >>> VRANGE_NOVOLATILE. VRANGE_NOVOLATILE is memory pin operation so
> >>> kernel coudn't discard any pages any more while VRANGE_VOLATILE
> >>> is memory unpin opeartion so kernel can discard pages in vrange
> >>> anytime. At a moment, behavior is one of VRANGE_FULL and VRANGE
> >>> PARTIAL. VRANGE_FULL tell kernel that once kernel decide to
> >>> discard page in a vrange, please, discard all of pages in a
> >>> vrange selected by victim vrange. VRANGE_PARTIAL tell kernel
> >>> that please discard of some pages in a vrange. But now I didn't
> >>> implemented VRANGE_PARTIAL handling yet.
> >>
> >>So I'm very excited to see this new revision! Moving away from the
> >>VMA based approach I think is really necessary, since managing the
> >>volatile ranges on a per-mm basis really isn't going to work when we
> >>want shared volatile ranges between processes (such as the
> >>shmem/tmpfs case Android uses).
> >>
> >>Just a few questions and observations from my initial playing around
> >>with the patch:
> >>
> >>1) So, I'm not sure I understand the benefit of VRANGE_PARTIAL. Why
> >>would VRANGE_PARTIAL be useful?
> >For exmaple, some process makes 64M vranges and now kernel needs 8M
> >pages to flee from memory pressure state. In this case, we don't need
> >to discard 64M all at once because if we discard only 8M page, the cost
> >of allocator is (8M/4K) * page(falut + allocation + zero-clearing)
> >while (64M/4K) * page(falut + allocation + zero-clearing), otherwise.
> >
> >If it were temporal image extracted on some compressed format, it's not
> >easy to regenerate punched hole data from original source so it would
> >be better to discard all pages in the vrange, which will be very far
> >from memory reclaimer.
>
> So, if I understand you properly, its more an issue of the the added
> cost of making the purged range non-volatile, and re-faulting in the
> pages if we purge them all, when we didn't actually have the memory
> pressure to warrant purging the entire range?
>
> Hrm. Ok, I can sort of see that.
>
> So if we do partial-purging, all the data in the range is invalid -
> since we don't know which pages in particular were purged, but the
> costs when marking the range non-volatile and the costs of
> over-writing the pages with the re-created data will be slightly
> cheaper.

It could be heavily cheaper with my experiment in this patchset.
Allocator could avoid minor fault from 105799867 to 9401.

>
> I guess the other benefit is if you're using the SIGBUS semantics,
> you might luck out and not actually touch a purged page. Where as if
> the entire range is purged, the process will definitely hit the
> SIGBUS if its accessing the volatile data.

Yes. I guess that's why Taras liked it.
Quote from old version
"
4) Having a new system call makes it easier for userspace apps to
detect kernels without this functionality.

I really like the proposed interface. I like the suggestion of having
explicit FULL|PARTIAL_VOLATILE. Why not include PARTIAL_VOLATILE as a
required 3rd param in first version with expectation that
FULL_VOLATILE will be added later(and returning some not-supported error
in meantime)?
"

>
>
> So yea, its starting to make sense.
>
> Much of my earlier confusion comes from comment in the vrange
> syscall implementation that suggests VRANGE_PARTIAL will purge from
> ranges intentionally in round-robin order, which I think is probably
> not advantageous, as it will invalidate more ranges causing more
> overhead. Instead using the normal page eviction order with
> _PARTIAL would probably be best.

As you know, I insisted on several time "volatile pages" is nothing
special so we should reclaim them in normal page order.
But I changed my mind because in allocator's POV, if we reclaim in
normal page order, VM can swapout other working set pages instead of
volatile pages. What happens if we don't have this feature in kernel?
They may call madvise(DONTNEED) or munmap so there wouldn't be swpped
out(it could be major fault afterward).

A major fault could mitigate this feature's benefit so I'd like to
sweep volatile pages out firstly.

If we really want to reclaim some vrange pages in normal page order,
we can add new argument in vrange system call and handle it later.
But I'm not sure we really need it.

>
>
> >>2) I've got a trivial test program that I've used previously with
> >>ashmem & my earlier file based efforts that allocates 26megs of page
> >>aligned memory, and marks every other meg as volatile. Then it forks
> >>and the child generates a ton of memory pressure, causing pages to
> >>be purged (and the child killed by the OOM killer). Initially I
> >>didn't see my test purging any pages with your patches. The problem
> >>of course was the child's COW pages were not also marked volatile,
> >>so they could not be purged. Once I over-wrote the data in the
> >>child, breaking the COW links, the data in the parent was purged
> >>under pressure. This is good, because it makes sure we don't purge
> >>cow pages if the volatility state isn't consistent, but it also
> >>brings up a few questions:
> >>
> >> - Should volatility be inherited on fork? If volatility is not
> >>inherited on fork(), that could cause some strange behavior if the
> >>data was purged prior to the fork, and also its not clear what the
> >>behavior of the child should be with regards to data that was
> >>volatile at fork time. However, we also don't want strange behavior
> >>on exec if overwritten volatile pages were unexpectedly purged.
> >I don't know why we should inherit volatility to child at least, for
> >anon vrange. Because it's not proper way to share the data.
> >For data sharing for anonymous page, we should use shmem so the work
> >could be done when we work tmpfs work, I guess.
>
> I'm not suggesting the volatile *pages* on fork would be shared
> (other then they are COW), instead my point is the volatile *state*
> of the pages should probably be preserved over a fork.
>
> Given the following example:
>
> buf = malloc(BIGBUF);
> memset(buf, 'P', BIGBUF);
> vrange(buf, BIGBUF, VRANGE_VOLATILE, VRANGE_FULL);
> pid = fork();
>
> if (!pid) /* break COW sharing*/
> memset(buf, 'C', BIGBUF);
>
> generate_memory_pressure();
> purged = vrange(buf, BIGBUF, VRANGE_NOVOLATILE, VRANGE_FULL);
>
>
> Currently, because vrange is set before the fork, in this example,
> only the parent's volatile range will be purged. However, if we were
> to move the fork() one line up, then both parent and child would see
> their separate ranges purged. This behavior is not quite intuitive,
> as I usually expect the childs state to be identical to the parents
> at fork time.
>
> In my mind, what would be less surprising is if in the above code,
> the volatility state of buf would be inherited to the child as well
> (basically copying the vrange tree at fork).
>
> And the cow breaking in the above is just for clarification, even if
> the COW links weren't broken and the pages were still shared between
> the child and parent after fork, since they both would consider the
> buffer state as volatile, it would still be ok to purge the pages.
>
> Now, the other side of the coin, is that if we have volatile data at
> fork time, but the child then goes on to call exec, we don't want
> the new process to randomly hit sigfaults when the earlier set
> volatile range is purged. So if we inherit volatile state to
> children, my thought is we probably want to clear all volatile state
> on exec.

Indeed, I got a your point and frankly speaking, I implemented it with
odd lock schemem in this version, then decided to drop it because
I was not sure who want to use such usecase scenario between parent and
child so tempted to drop it. ;-)

Okay. I have a idea, I will support it in next spin and I don't think
it is never odd that child has volatile data at forktime and exec of
child clear them.

>
>
>
>
> >>
> >>4) One of the harder aspects I'm trying to get my head around is how
> >>your patches seem to use both the page list shrinkers
> >>(discard_vpage) to purge ranges when particular pages selected, and
> >>a zone shrinker (discard_vrange_pages) which manages its own lru of
> >>vranges. I get that this is one way to handle purging anonymous
> >>pages when we are on a swapless system, but the dual purging systems
> >>definitely make the code harder to follow. Would something like my
> >discard_vpage is for avoiding swapping out in direct reclaim path
> >when kswapd miss the page.
> >
> >discard_vrange_pages is for handling volatile pages as top prioirty
> >prio to reclaim non-volatile pages.
>
> So one note: while I've pushed for freeing volatile pages first in
> the past, I know Mel has had some objections to this, for instance,

Me, too at that time. But I changed my mind as I mentioned earlier.

> he thought there are cases where freeing the volatile data first
> wasn't the right thing to do, such as the case with streaming data,
> and that we probably want to leave it to the page eviction LRUs to
> pick the pages for us.

I agree on streaming data. It would be great to reclaim them rather than
vrange pages. But current VM isn't smart to detect streaming data and
reclaim them as top priority without user's help.

If user helps kernel with hint(ex, fadvise), kernel free them instantly
so there isn't any remained pages in LRU and if kernel can't free pages
with some reason(dirty or locked), kernel can move them into inactive
LRU's tail so in next turn, kernel can reclaim them as top-priority if
they meet free-condition.

It means what we have to care is remained streaming data in LRU when
user helps kernel with some advise.
If it's really severe problem, I'd like to introduce new LRU list called by
easy-reclaimable so kernel can move them in easy-lru list and reclaim
them as top-priority before discarding vrange pages.

>
>
> >
> >I think it's very clear, NOT to understand. :)
> >And discard_vpage is basic core function to discard volatile page
> >so it could be used many places.
>
> Ok, I suspect it will make more sense as I get more familiar with it. :)
>
> >
> >>earlier attempts at changing vmscan to shrink anonymous pages be
> >>simpler? Or is that just not going to fly w/ the mm folks?
> >There were many attempt at old. Could you point out?
>
> https://lkml.org/lkml/2012/6/12/587
> Although I know you had objections to my specific implementation,
> since it kept non-volatile anonymous pages on the active list.

It breaks my goal "Hint system call should be cheap" because we need
something to move pages from active list to inactive's one in volatile
system call context and it would be never cheap.

>
>
>
> thanks
> -john

--
Kind regards,
Minchan Kim

Subject: Re: [RFC v7 00/11] Support vrange for anonymous page


Hi,

On Tuesday 12 March 2013 08:38:24 Minchan Kim wrote:
> First of all, let's define the term.
> From now on, I'd like to call it as vrange(a.k.a volatile range)
> for anonymous page. If you have a better name in mind, please suggest.
>
> This version is still *RFC* because it's just quick prototype so
> it doesn't support THP/HugeTLB/KSM and even couldn't build on !x86.
> Before further sorting out issues, I'd like to post current direction
> and discuss it. Of course, I'd like to extend this discussion in
> comming LSF/MM.
>
> In this version, I changed lots of thing, expecially removed vma-based
> approach because it needs write-side lock for mmap_sem, which will drop
> performance in mutli-threaded big SMP system, KOSAKI pointed out.
> And vma-based approach is hard to meet requirement of new system call by
> John Stultz's suggested semantic for consistent purged handling.
> (http://linux-kernel.2935.n7.nabble.com/RFC-v5-0-8-Support-volatile-for-anonymous-range-tt575773.html#none)
>
> I tested this patchset with modified jemalloc allocator which was
> leaded by Jason Evans(jemalloc author) who was interest in this feature
> and was happy to port his allocator to use new system call.
> Super Thanks Jason!
>
> The benchmark for test is ebizzy. It have been used for testing the
> allocator performance so it's good for me. Again, thanks for recommending
> the benchmark, Jason.
> (http://people.freebsd.org/~kris/scaling/ebizzy.html)
>
> The result is good on my machine (12 CPU, 1.2GHz, DRAM 2G)
>
> ebizzy -S 20
>
> jemalloc-vanilla: 52389 records/sec
> jemalloc-vrange: 203414 records/sec
>
> ebizzy -S 20 with background memory pressure
>
> jemalloc-vanilla: 40746 records/sec
> jemalloc-vrange: 174910 records/sec

Could you please make the modified jemalloc/ebizzy available somewhere so
there is a easy way to test your patchset?

Best regards,
--
Bartlomiej Zolnierkiewicz
Samsung Poland R&D Center

2013-03-27 00:32:24

by John Stultz

[permalink] [raw]
Subject: Re: [RFC v7 00/11] Support vrange for anonymous page

On 03/25/2013 01:42 AM, Minchan Kim wrote:
> On Fri, Mar 22, 2013 at 10:06:56AM -0700, John Stultz wrote:
>> So, if I understand you properly, its more an issue of the the added
>> cost of making the purged range non-volatile, and re-faulting in the
>> pages if we purge them all, when we didn't actually have the memory
>> pressure to warrant purging the entire range? Hrm. Ok, I can sort of
>> see that. So if we do partial-purging, all the data in the range is
>> invalid - since we don't know which pages in particular were purged,
>> but the costs when marking the range non-volatile and the costs of
>> over-writing the pages with the re-created data will be slightly
>> cheaper.
> It could be heavily cheaper with my experiment in this patchset.
> Allocator could avoid minor fault from 105799867 to 9401.
>
>> I guess the other benefit is if you're using the SIGBUS semantics,
>> you might luck out and not actually touch a purged page. Where as if
>> the entire range is purged, the process will definitely hit the
>> SIGBUS if its accessing the volatile data.
> Yes. I guess that's why Taras liked it.
> Quote from old version
> "
> 4) Having a new system call makes it easier for userspace apps to
> detect kernels without this functionality.
>
> I really like the proposed interface. I like the suggestion of having
> explicit FULL|PARTIAL_VOLATILE. Why not include PARTIAL_VOLATILE as a
> required 3rd param in first version with expectation that
> FULL_VOLATILE will be added later(and returning some not-supported error
> in meantime)?
> "

Thanks again for the clarifications on your though process here!

I'm currently trying to rework your patches so we can reuse this for
file data as well as pure anonymous memory. The idea being that we add
one level of indirection: a vrange_root structure, which manages the
root of the rb interval tree as well as the lock. This vrange_root can
then be included in the mm_struct as well as address_space structures
depending on which type of memory we're dealing with. That way most of
the same infrastructure can be used to manage per-mm volatile ranges as
well as per-inode volatile ranges.

Sorting out how to handle vrange() calls that cross both anonymous and
file vmas will be interesting, and may have some of the drawbacks of the
vma based approach, but I think it will still be simpler. To start we
may just be able to require that any vrange() calls don't cross vma
types (possibly using separate syscalls for file and anonymous vranges).

Anyway, that's my current thinkig. You can preview my current attempt here:
http://git.linaro.org/gitweb?p=people/jstultz/android-dev.git;a=shortlog;h=refs/heads/dev/vrange-minchan

Thanks so much again for your moving this work forward!
-john

2013-03-27 07:18:25

by Minchan Kim

[permalink] [raw]
Subject: Re: [RFC v7 00/11] Support vrange for anonymous page

Hi Bart,

On Mon, Mar 25, 2013 at 06:16:16PM +0100, Bartlomiej Zolnierkiewicz wrote:
>
> Hi,
>
> On Tuesday 12 March 2013 08:38:24 Minchan Kim wrote:
> > First of all, let's define the term.
> > From now on, I'd like to call it as vrange(a.k.a volatile range)
> > for anonymous page. If you have a better name in mind, please suggest.
> >
> > This version is still *RFC* because it's just quick prototype so
> > it doesn't support THP/HugeTLB/KSM and even couldn't build on !x86.
> > Before further sorting out issues, I'd like to post current direction
> > and discuss it. Of course, I'd like to extend this discussion in
> > comming LSF/MM.
> >
> > In this version, I changed lots of thing, expecially removed vma-based
> > approach because it needs write-side lock for mmap_sem, which will drop
> > performance in mutli-threaded big SMP system, KOSAKI pointed out.
> > And vma-based approach is hard to meet requirement of new system call by
> > John Stultz's suggested semantic for consistent purged handling.
> > (http://linux-kernel.2935.n7.nabble.com/RFC-v5-0-8-Support-volatile-for-anonymous-range-tt575773.html#none)
> >
> > I tested this patchset with modified jemalloc allocator which was
> > leaded by Jason Evans(jemalloc author) who was interest in this feature
> > and was happy to port his allocator to use new system call.
> > Super Thanks Jason!
> >
> > The benchmark for test is ebizzy. It have been used for testing the
> > allocator performance so it's good for me. Again, thanks for recommending
> > the benchmark, Jason.
> > (http://people.freebsd.org/~kris/scaling/ebizzy.html)
> >
> > The result is good on my machine (12 CPU, 1.2GHz, DRAM 2G)
> >
> > ebizzy -S 20
> >
> > jemalloc-vanilla: 52389 records/sec
> > jemalloc-vrange: 203414 records/sec
> >
> > ebizzy -S 20 with background memory pressure
> >
> > jemalloc-vanilla: 40746 records/sec
> > jemalloc-vrange: 174910 records/sec
>
> Could you please make the modified jemalloc/ebizzy available somewhere so
> there is a easy way to test your patchset?

I will try it in next spin.
Thanks for your interest!

-
Kind regards,
Minchan Kim

2013-03-27 08:03:38

by Minchan Kim

[permalink] [raw]
Subject: Re: [RFC v7 00/11] Support vrange for anonymous page

On Tue, Mar 26, 2013 at 05:26:04PM -0700, John Stultz wrote:
> On 03/25/2013 01:42 AM, Minchan Kim wrote:
> >On Fri, Mar 22, 2013 at 10:06:56AM -0700, John Stultz wrote:
> >>So, if I understand you properly, its more an issue of the the
> >>added cost of making the purged range non-volatile, and
> >>re-faulting in the pages if we purge them all, when we didn't
> >>actually have the memory pressure to warrant purging the entire
> >>range? Hrm. Ok, I can sort of see that. So if we do
> >>partial-purging, all the data in the range is invalid - since we
> >>don't know which pages in particular were purged, but the costs
> >>when marking the range non-volatile and the costs of
> >>over-writing the pages with the re-created data will be slightly
> >>cheaper.
> >It could be heavily cheaper with my experiment in this patchset.
> >Allocator could avoid minor fault from 105799867 to 9401.
> >
> >>I guess the other benefit is if you're using the SIGBUS semantics,
> >>you might luck out and not actually touch a purged page. Where as if
> >>the entire range is purged, the process will definitely hit the
> >>SIGBUS if its accessing the volatile data.
> >Yes. I guess that's why Taras liked it.
> >Quote from old version
> >"
> >4) Having a new system call makes it easier for userspace apps to
> > detect kernels without this functionality.
> >
> >I really like the proposed interface. I like the suggestion of having
> >explicit FULL|PARTIAL_VOLATILE. Why not include PARTIAL_VOLATILE as a
> >required 3rd param in first version with expectation that
> >FULL_VOLATILE will be added later(and returning some not-supported error
> >in meantime)?
> >"
>
> Thanks again for the clarifications on your though process here!
>
> I'm currently trying to rework your patches so we can reuse this for
> file data as well as pure anonymous memory. The idea being that we
> add one level of indirection: a vrange_root structure, which manages
> the root of the rb interval tree as well as the lock. This
> vrange_root can then be included in the mm_struct as well as
> address_space structures depending on which type of memory we're
> dealing with. That way most of the same infrastructure can be used
> to manage per-mm volatile ranges as well as per-inode volatile
> ranges.

Yeb.

>
> Sorting out how to handle vrange() calls that cross both anonymous
> and file vmas will be interesting, and may have some of the
> drawbacks of the vma based approach, but I think it will still be

Do you have any specific drawback examples?
I'd like to solve it if it is critical and I believe we shouldn't
do that for simpler implementation.

> simpler. To start we may just be able to require that any vrange()
> calls don't cross vma types (possibly using separate syscalls for
> file and anonymous vranges).

I can't parse what's the problem you have a concern.
Why should we have separate syscall?

>
> Anyway, that's my current thinkig. You can preview my current attempt here:
> http://git.linaro.org/gitweb?p=people/jstultz/android-dev.git;a=shortlog;h=refs/heads/dev/vrange-minchan
>

I saw it roughly and it seems good to me.
I will review it in detail if you send formal patch. :)

Off-topic

Let's think another vrange usecase for file pages.
I'm thinking now it might be useful as hint interface for kernel.
As you know, we already have hint interface, madivse and fadvise.
But they are always heavy because kernel should spend a time
to handle all pages of the range so the cost is increased linearly
as range's size. Another problem is it doesn't consider system global
wide view. One example is that look at below.
Look at the http://permalink.gmane.org/gmane.linux.kernel.mm/95424
There were similar several trial long time ago but rejected because
it could change current behavior if the system call move pages into
inactive list without freeing pages instanlty.

Andrew also suggested it "let's create another advise rather than
replace old advise" for compatibility.

So the what I want is new interface but totally different system call
,vrange. Because I believe hint system call should be very cheap so that
many user can use it frequently. If user don't use it often,
kernel doesn't have any benefit, either.

The vrange system call could be cheap because we can move hot path
overhead to slow path(reclaim path). And we can define new behavior to
vrange so we could implement new idea freely.

I think it's good for system memory handling.
As you know well, there are several trial to handle memory management
in userspace. One of example is lowmemory notifier. Kernel just send
signal and user can free pages. Frankly speaking, I don't like that idea.
Because there are several factors to limit userspace daemon's bounded
reaction and could have false-positive alarm if system has streaming data,
mlocked pages or many dirty pages and so on.

Anyway, my point is that I'd like to control page reclaiming in only
kernel itself. For it, userspace can register their volatile or
reclaimable memory ranges to kernel and define to the threshold.
If kernel find memory is below threshold user defined, kernel can
reclaim every pages in registered range freely.

It means kernel has a ownership of page freeing. It makes system more
deterministic and not out-of-control.

So vrange system call's semantic is following as.

1. vrange for anonymous page -> Discard wthout swapout
2. vrange for file-backed page except shmem/tmpfs -> Discard without sync
3. vrange for shmem/tmpfs -> hole punching

It's just my two cents. ;-)

> Thanks so much again for your moving this work forward!

Thanks for your collaboration!

--
Kind regards,
Minchan Kim

2013-03-30 00:05:23

by John Stultz

[permalink] [raw]
Subject: Re: [RFC v7 00/11] Support vrange for anonymous page

On 03/27/2013 01:03 AM, Minchan Kim wrote:
> On Tue, Mar 26, 2013 at 05:26:04PM -0700, John Stultz wrote:
>> Sorting out how to handle vrange() calls that cross both anonymous
>> and file vmas will be interesting, and may have some of the
>> drawbacks of the vma based approach, but I think it will still be
> Do you have any specific drawback examples?
> I'd like to solve it if it is critical and I believe we shouldn't
> do that for simpler implementation.

My current thought is that we manage volatile memory on both a per-mm
(for anonymous memory) and per-address_space (for file memory) basis.

The down side, if we manage both file and anonymous volatile ranges with
the same interface, we may have similar problems to the per-vma approach
you were trying before. Specifically, if a single range covers both
anonymous and file memory, we'll have to do a similar iterating over the
different types of ranges, as we did with your earlier vma approach.

This adds some complexity since with the single interval tree method in
your current patch, we know we only have to allocate one additional
range per insert/remove. So we can do that right off the bat, and return
any enomem errors without having made any state changes. This is a nice
quality to have.

Where as if we're iterating over different types of ranges, with
possibly multiple trees (ie: different mmapped files), we don't know how
many new ranges we may have to allocate, so we could fail half way which
causes ambiguous results on the marking ranges non-volatile (since
returning the error leaves the range possibly half-unmarked).


I'm still thinking it through, but that's my concern.

Some ways we can avoid this:
1) Require that any vrange() call not cross different types of memory.
2) Provide a different vrange call (fvrange?)to be used with file backed
memory.

Any other thoughts?


>> Anyway, that's my current thinkig. You can preview my current attempt here:
>> http://git.linaro.org/gitweb?p=people/jstultz/android-dev.git;a=shortlog;h=refs/heads/dev/vrange-minchan
>>
> I saw it roughly and it seems good to me.
> I will review it in detail if you send formal patch. :)
Ok. I'm still working on some changes (been slow this week), but hope to
have more to send your way next week.

> As you know well, there are several trial to handle memory management
> in userspace. One of example is lowmemory notifier. Kernel just send
> signal and user can free pages. Frankly speaking, I don't like that idea.
> Because there are several factors to limit userspace daemon's bounded
> reaction and could have false-positive alarm if system has streaming data,
> mlocked pages or many dirty pages and so on.

True. However, I think that there are valid use cases lowmemory
notification (Android's low-memory killer is one, where we're not just
freeing pages, but killing processes), and I think both approaches have
valid use.

> Anyway, my point is that I'd like to control page reclaiming in only
> kernel itself. For it, userspace can register their volatile or
> reclaimable memory ranges to kernel and define to the threshold.
> If kernel find memory is below threshold user defined, kernel can
> reclaim every pages in registered range freely.
>
> It means kernel has a ownership of page freeing. It makes system more
> deterministic and not out-of-control.
>
> So vrange system call's semantic is following as.
>
> 1. vrange for anonymous page -> Discard wthout swapout
> 2. vrange for file-backed page except shmem/tmpfs -> Discard without sync
> 3. vrange for shmem/tmpfs -> hole punching
I think on non-shmem file backed pages (case #2) hole punching will be
required as well. Though I'm not totally convinced volatile ranges on
non-tmpfs files actually makes sense (I still have yet to understand a
use case).


Thanks again for your thoughts here.

thanks
-john

2013-04-01 07:57:57

by Minchan Kim

[permalink] [raw]
Subject: Re: [RFC v7 00/11] Support vrange for anonymous page

On Fri, Mar 29, 2013 at 05:05:17PM -0700, John Stultz wrote:
> On 03/27/2013 01:03 AM, Minchan Kim wrote:
> >On Tue, Mar 26, 2013 at 05:26:04PM -0700, John Stultz wrote:
> >>Sorting out how to handle vrange() calls that cross both anonymous
> >>and file vmas will be interesting, and may have some of the
> >>drawbacks of the vma based approach, but I think it will still be
> >Do you have any specific drawback examples?
> >I'd like to solve it if it is critical and I believe we shouldn't
> >do that for simpler implementation.
>
> My current thought is that we manage volatile memory on both a
> per-mm (for anonymous memory) and per-address_space (for file
> memory) basis.

First of all, I should have a dumb question because I didn't thought
about tmpfs usecase deeply as you had so I hope this stupid question
opens my eye.

I thought like this

1. vrange system call doesn't care about whether the range is anonymous or not.
2. Discarder(At the moment, kswapd hook or kvranged in the future,
direct page reclaim) can parse whether a vma could be anonymous or file-backed
3. If a vma which in vrange is anonymous, it could discard a page
rather than swapping out.
4. If a vma which in vrange is file-backed(ie, tmpfs), it could discard a page
rather than swappint out. => It would be same effect punch hole.

Both 3 and 4 could be handled by rmap so it couldn't discard a page if anyone
mapped a page with non-volatile.

In this scenario, I can't imagine what kind of role per-address_space does.
So, my question is that why we need per-address space vrange management.

If I read you mind, are you considering fd-based system call,
not mmaped-address space approach to replace ashmem?

>
> The down side, if we manage both file and anonymous volatile ranges
> with the same interface, we may have similar problems to the per-vma
> approach you were trying before. Specifically, if a single range
> covers both anonymous and file memory, we'll have to do a similar
> iterating over the different types of ranges, as we did with your
> earlier vma approach.

As I said earlier, I don't want to care about new range is anonymous
or file-backed on vrange system call context. It's just vrange then
could be handled properly later when memory pressure happens.

>
> This adds some complexity since with the single interval tree method
> in your current patch, we know we only have to allocate one
> additional range per insert/remove. So we can do that right off the
> bat, and return any enomem errors without having made any state
> changes. This is a nice quality to have.
>
> Where as if we're iterating over different types of ranges, with
> possibly multiple trees (ie: different mmapped files), we don't know
> how many new ranges we may have to allocate, so we could fail half
> way which causes ambiguous results on the marking ranges
> non-volatile (since returning the error leaves the range possibly
> half-unmarked).

Maybe, I can understand your point after seeing your concern with more
concrete example. :)

>
>
> I'm still thinking it through, but that's my concern.
>
> Some ways we can avoid this:
> 1) Require that any vrange() call not cross different types of memory.
> 2) Provide a different vrange call (fvrange?)to be used with file
> backed memory.
>
> Any other thoughts?
>
>
> >>Anyway, that's my current thinkig. You can preview my current attempt here:
> >>http://git.linaro.org/gitweb?p=people/jstultz/android-dev.git;a=shortlog;h=refs/heads/dev/vrange-minchan
> >>
> >I saw it roughly and it seems good to me.
> >I will review it in detail if you send formal patch. :)
> Ok. I'm still working on some changes (been slow this week), but
> hope to have more to send your way next week.
>
> >As you know well, there are several trial to handle memory management
> >in userspace. One of example is lowmemory notifier. Kernel just send
> >signal and user can free pages. Frankly speaking, I don't like that idea.
> >Because there are several factors to limit userspace daemon's bounded
> >reaction and could have false-positive alarm if system has streaming data,
> >mlocked pages or many dirty pages and so on.
>
> True. However, I think that there are valid use cases lowmemory
> notification (Android's low-memory killer is one, where we're not
> just freeing pages, but killing processes), and I think both
> approaches have valid use.

Yeb. I didn't say userspace memory notifier is useless.
My point is userspace memory notifier could be very fragile by several
reasons so it works well if system has some freed memory, I mean it doesn't
work well if system has big memory pressure. In this case, it would be
better for kernel to reclaim pages rather than depending user's response.

Yeb. Of course, there is a trade-off there.
Kernel doesn't have enough knowledge rather than user space so kernel
would work with sub-optimal pages.
So my suggestion is that plaform can use lowmemory notifier
when system memory pressure is mild but when system memory pressure is
approaching OOM, kernel should reclaim them instantly and before that,
user can give a hint to kernel which range of the address space
is recovery-possible. And it could be another usecases of vrange.

>
> >Anyway, my point is that I'd like to control page reclaiming in only
> >kernel itself. For it, userspace can register their volatile or
> >reclaimable memory ranges to kernel and define to the threshold.
> >If kernel find memory is below threshold user defined, kernel can
> >reclaim every pages in registered range freely.
> >
> >It means kernel has a ownership of page freeing. It makes system more
> >deterministic and not out-of-control.
> >
> >So vrange system call's semantic is following as.
> >
> >1. vrange for anonymous page -> Discard wthout swapout
> >2. vrange for file-backed page except shmem/tmpfs -> Discard without sync
> >3. vrange for shmem/tmpfs -> hole punching
> I think on non-shmem file backed pages (case #2) hole punching will
> be required as well. Though I'm not totally convinced volatile

What I mean to is let's use vrange system call instead of madvise/fadvise.
so for file-backed pages excpet shmem/tmpfs couldn't be discarded but just
could be reclaimed.

> ranges on non-tmpfs files actually makes sense (I still have yet to
> understand a use case).

1. The fadvise/madvise reclaim pages instantly if memory pressure doesn't happen.
2. It doesn't consider system-wide view but per-process's one
3. System call cost could be increased by range size of the system call.

vrange could recover above problems.

I'm not serious about this usecase but had just out of curiosity for how
others think about it.

--
Kind regards,
Minchan Kim

2013-04-10 20:23:09

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [RFC v7 00/11] Support vrange for anonymous page

(3/12/13 3:38 AM), Minchan Kim wrote:
> First of all, let's define the term.
> From now on, I'd like to call it as vrange(a.k.a volatile range)
> for anonymous page. If you have a better name in mind, please suggest.
>
> This version is still *RFC* because it's just quick prototype so
> it doesn't support THP/HugeTLB/KSM and even couldn't build on !x86.
> Before further sorting out issues, I'd like to post current direction
> and discuss it. Of course, I'd like to extend this discussion in
> comming LSF/MM.
>
> In this version, I changed lots of thing, expecially removed vma-based
> approach because it needs write-side lock for mmap_sem, which will drop
> performance in mutli-threaded big SMP system, KOSAKI pointed out.
> And vma-based approach is hard to meet requirement of new system call by
> John Stultz's suggested semantic for consistent purged handling.
> (http://linux-kernel.2935.n7.nabble.com/RFC-v5-0-8-Support-volatile-for-anonymous-range-tt575773.html#none)
>
> I tested this patchset with modified jemalloc allocator which was
> leaded by Jason Evans(jemalloc author) who was interest in this feature
> and was happy to port his allocator to use new system call.
> Super Thanks Jason!
>
> The benchmark for test is ebizzy. It have been used for testing the
> allocator performance so it's good for me. Again, thanks for recommending
> the benchmark, Jason.
> (http://people.freebsd.org/~kris/scaling/ebizzy.html)
>
> The result is good on my machine (12 CPU, 1.2GHz, DRAM 2G)
>
> ebizzy -S 20
>
> jemalloc-vanilla: 52389 records/sec
> jemalloc-vrange: 203414 records/sec
>
> ebizzy -S 20 with background memory pressure
>
> jemalloc-vanilla: 40746 records/sec
> jemalloc-vrange: 174910 records/sec
>
> And it's much improved on KVM virtual machine.
>
> This patchset is based on v3.9-rc2
>
> - What's the sys_vrange(addr, length, mode, behavior)?
>
> It's a hint that user deliver to kernel so kernel can *discard*
> pages in a range anytime. mode is one of VRANGE_VOLATILE and
> VRANGE_NOVOLATILE. VRANGE_NOVOLATILE is memory pin operation so
> kernel coudn't discard any pages any more while VRANGE_VOLATILE
> is memory unpin opeartion so kernel can discard pages in vrange
> anytime. At a moment, behavior is one of VRANGE_FULL and VRANGE
> PARTIAL. VRANGE_FULL tell kernel that once kernel decide to
> discard page in a vrange, please, discard all of pages in a
> vrange selected by victim vrange. VRANGE_PARTIAL tell kernel
> that please discard of some pages in a vrange. But now I didn't
> implemented VRANGE_PARTIAL handling yet.
>
> - What happens if user access page(ie, virtual address) discarded
> by kernel?
>
> The user can encounter SIGBUS.
>
> - What should user do for avoding SIGBUS?
> He should call vrange(addr, length, VRANGE_NOVOLATILE, mode) before
> accessing the range which was called
> vrange(addr, length, VRANGE_VOLATILE, mode)
>
> - What happens if user access page(ie, virtual address) doesn't
> discarded by kernel?
>
> The user can see vaild data which was there before calling
> vrange(., VRANGE_VOLATILE) without page fault.
>
> - What's different with madvise(DONTNEED)?
>
> System call semantic
>
> DONTNEED makes sure user always can see zero-fill pages after
> he calls madvise while vrange can see data or encounter SIGBUS.

For replacing DONTNEED, user want to zero-fill pages like DONTNEED
instead of SIGBUS. So, new flag option would be nice.

I played a bit this patch. The result looks really promissing.
(i.e. 20x faster)

My machine have 24cpus, 8GB ram, kvm guest. I guess current DONTNEED
implementation doesn't fit kvm at all.


# of # of # of
thread iter iter (patched glibc)
----------------------------------
1 438 10740
2 842 20916
4 987 32534
8 717 15155
12 714 14109
16 708 13457
20 720 13742
24 727 13642
28 715 13328
32 709 13096
36 705 13661
40 708 13634
44 707 13367
48 714 13377


---------libc patch (just dirty hack) ----------------------

diff --git a/malloc/arena.c b/malloc/arena.c
index 12a48ad..da04f67 100644
--- a/malloc/arena.c
+++ b/malloc/arena.c
@@ -365,6 +365,8 @@ extern struct dl_open_hook *_dl_open_hook;
libc_hidden_proto (_dl_open_hook);
#endif

+int vrange_enabled = 0;
+
static void
ptmalloc_init (void)
{
@@ -457,6 +459,18 @@ ptmalloc_init (void)
if (check_action != 0)
__malloc_check_init();
}
+
+ {
+ char *vrange = getenv("MALLOC_VRANGE");
+ if (vrange) {
+ int val = atoi(vrange);
+ if (val) {
+ printf("glibc: vrange enabled\n");
+ vrange_enabled = !!val;
+ }
+ }
+ }
+
void (*hook) (void) = force_reg (__malloc_initialize_hook);
if (hook != NULL)
(*hook)();
@@ -628,9 +642,14 @@ shrink_heap(heap_info *h, long diff)
return -2;
h->mprotect_size = new_size;
}
- else
- __madvise ((char *)h + new_size, diff, MADV_DONTNEED);
+ else {
+ if (vrange_enabled) {
+ syscall(314, (char *)h + new_size, diff, 0, 1);
+ } else {
+ __madvise ((char *)h + new_size, diff, MADV_DONTNEED);
+ }
/*fprintf(stderr, "shrink %p %08lx\n", h, new_size);*/
+ }

h->size = new_size;
return 0;
diff --git a/malloc/malloc.c b/malloc/malloc.c
index 70b9329..3782244 100644
--- a/malloc/malloc.c
+++ b/malloc/malloc.c
@@ -4403,6 +4403,7 @@ _int_pvalloc(mstate av, size_t bytes)
/*
------------------------------ malloc_trim ------------------------------
*/
+extern int vrange_enabled;

static int mtrim(mstate av, size_t pad)
{
@@ -4443,7 +4444,12 @@ static int mtrim(mstate av, size_t pad)
content. */
memset (paligned_mem, 0x89, size & ~psm1);
#endif
- __madvise (paligned_mem, size & ~psm1, MADV_DONTNEED);
+
+ if (vrange_enabled) {
+ syscall(314, paligned_mem, size & ~psm1, 0, 1);
+ } else {
+ __madvise (paligned_mem, size & ~psm1, MADV_DONTNEED);
+ }

result = 1;
}
(END)





2013-04-11 06:55:53

by Minchan Kim

[permalink] [raw]
Subject: Re: [RFC v7 00/11] Support vrange for anonymous page

On Wed, Apr 10, 2013 at 04:22:58PM -0400, KOSAKI Motohiro wrote:
> (3/12/13 3:38 AM), Minchan Kim wrote:
> > First of all, let's define the term.
> > From now on, I'd like to call it as vrange(a.k.a volatile range)
> > for anonymous page. If you have a better name in mind, please suggest.
> >
> > This version is still *RFC* because it's just quick prototype so
> > it doesn't support THP/HugeTLB/KSM and even couldn't build on !x86.
> > Before further sorting out issues, I'd like to post current direction
> > and discuss it. Of course, I'd like to extend this discussion in
> > comming LSF/MM.
> >
> > In this version, I changed lots of thing, expecially removed vma-based
> > approach because it needs write-side lock for mmap_sem, which will drop
> > performance in mutli-threaded big SMP system, KOSAKI pointed out.
> > And vma-based approach is hard to meet requirement of new system call by
> > John Stultz's suggested semantic for consistent purged handling.
> > (http://linux-kernel.2935.n7.nabble.com/RFC-v5-0-8-Support-volatile-for-anonymous-range-tt575773.html#none)
> >
> > I tested this patchset with modified jemalloc allocator which was
> > leaded by Jason Evans(jemalloc author) who was interest in this feature
> > and was happy to port his allocator to use new system call.
> > Super Thanks Jason!
> >
> > The benchmark for test is ebizzy. It have been used for testing the
> > allocator performance so it's good for me. Again, thanks for recommending
> > the benchmark, Jason.
> > (http://people.freebsd.org/~kris/scaling/ebizzy.html)
> >
> > The result is good on my machine (12 CPU, 1.2GHz, DRAM 2G)
> >
> > ebizzy -S 20
> >
> > jemalloc-vanilla: 52389 records/sec
> > jemalloc-vrange: 203414 records/sec
> >
> > ebizzy -S 20 with background memory pressure
> >
> > jemalloc-vanilla: 40746 records/sec
> > jemalloc-vrange: 174910 records/sec
> >
> > And it's much improved on KVM virtual machine.
> >
> > This patchset is based on v3.9-rc2
> >
> > - What's the sys_vrange(addr, length, mode, behavior)?
> >
> > It's a hint that user deliver to kernel so kernel can *discard*
> > pages in a range anytime. mode is one of VRANGE_VOLATILE and
> > VRANGE_NOVOLATILE. VRANGE_NOVOLATILE is memory pin operation so
> > kernel coudn't discard any pages any more while VRANGE_VOLATILE
> > is memory unpin opeartion so kernel can discard pages in vrange
> > anytime. At a moment, behavior is one of VRANGE_FULL and VRANGE
> > PARTIAL. VRANGE_FULL tell kernel that once kernel decide to
> > discard page in a vrange, please, discard all of pages in a
> > vrange selected by victim vrange. VRANGE_PARTIAL tell kernel
> > that please discard of some pages in a vrange. But now I didn't
> > implemented VRANGE_PARTIAL handling yet.
> >
> > - What happens if user access page(ie, virtual address) discarded
> > by kernel?
> >
> > The user can encounter SIGBUS.
> >
> > - What should user do for avoding SIGBUS?
> > He should call vrange(addr, length, VRANGE_NOVOLATILE, mode) before
> > accessing the range which was called
> > vrange(addr, length, VRANGE_VOLATILE, mode)
> >
> > - What happens if user access page(ie, virtual address) doesn't
> > discarded by kernel?
> >
> > The user can see vaild data which was there before calling
> > vrange(., VRANGE_VOLATILE) without page fault.
> >
> > - What's different with madvise(DONTNEED)?
> >
> > System call semantic
> >
> > DONTNEED makes sure user always can see zero-fill pages after
> > he calls madvise while vrange can see data or encounter SIGBUS.
>
> For replacing DONTNEED, user want to zero-fill pages like DONTNEED
> instead of SIGBUS. So, new flag option would be nice.

If userspace people want it, I can do it.
But not sure they want it at the moment becaue vrange is rather
different concept of madvise(DONTNEED) POV usage.

As you know well, in case of DONTNEED, user calls madvise _once_ and
VM releases memory as soon as he called system call.
But vrange is same with delayed free when the system memory pressure
happens so user can't know OS frees the pages anytime.
It means user should call pair of system call both VRANGE_VOLATILE
and VRANGE_NOVOLATILE for right usage of volatile range
(for simple, I don't want to tell SIGBUS fault recovery method).
If he took a mistake(ie, NOT to call VRANGE_NOVOLATILE) on the range
which is used by current process, pages used by some process could be
disappeared suddenly.

In summary, I don't think vrange is a replacement of madvise(DONTNEED)
but could be useful with madvise(DONTNEED) friend. For example, we can
make return 1 in vrange(VRANGE_VOLATILE) if memory pressure was already
severe so user can catch up memory pressure by return value and calls
madvise(DONTNEED) if memory pressure was already severe. Of course, we
can handle it vrange system call itself(ex, change vrange system call to
madvise(DONTNEED) but don't want it because I want to keep vrange hinting
sytem call very light at all times so user can expect latency.

>
> I played a bit this patch. The result looks really promissing.
> (i.e. 20x faster)

Thanks for the testing with Glibc!
Yes. Although I didn't post my KVM test result with jemalloc, it could be
very good, too.

>
> My machine have 24cpus, 8GB ram, kvm guest. I guess current DONTNEED
> implementation doesn't fit kvm at all.

Yes. I expect virtual machine MMU/cache/TLB handling would be expensive
than bare box.

>
>
> # of # of # of
> thread iter iter (patched glibc)

What's the workload?

> ----------------------------------
> 1 438 10740
> 2 842 20916
> 4 987 32534
> 8 717 15155
> 12 714 14109
> 16 708 13457
> 20 720 13742
> 24 727 13642
> 28 715 13328
> 32 709 13096
> 36 705 13661
> 40 708 13634
> 44 707 13367
> 48 714 13377
>
>
> ---------libc patch (just dirty hack) ----------------------
>
> diff --git a/malloc/arena.c b/malloc/arena.c
> index 12a48ad..da04f67 100644
> --- a/malloc/arena.c
> +++ b/malloc/arena.c
> @@ -365,6 +365,8 @@ extern struct dl_open_hook *_dl_open_hook;
> libc_hidden_proto (_dl_open_hook);
> #endif
>
> +int vrange_enabled = 0;
> +
> static void
> ptmalloc_init (void)
> {
> @@ -457,6 +459,18 @@ ptmalloc_init (void)
> if (check_action != 0)
> __malloc_check_init();
> }
> +
> + {
> + char *vrange = getenv("MALLOC_VRANGE");
> + if (vrange) {
> + int val = atoi(vrange);
> + if (val) {
> + printf("glibc: vrange enabled\n");
> + vrange_enabled = !!val;
> + }
> + }
> + }
> +
> void (*hook) (void) = force_reg (__malloc_initialize_hook);
> if (hook != NULL)
> (*hook)();
> @@ -628,9 +642,14 @@ shrink_heap(heap_info *h, long diff)
> return -2;
> h->mprotect_size = new_size;
> }
> - else
> - __madvise ((char *)h + new_size, diff, MADV_DONTNEED);
> + else {
> + if (vrange_enabled) {
> + syscall(314, (char *)h + new_size, diff, 0, 1);
> + } else {
> + __madvise ((char *)h + new_size, diff, MADV_DONTNEED);
> + }
> /*fprintf(stderr, "shrink %p %08lx\n", h, new_size);*/
> + }
>
> h->size = new_size;
> return 0;
> diff --git a/malloc/malloc.c b/malloc/malloc.c
> index 70b9329..3782244 100644
> --- a/malloc/malloc.c
> +++ b/malloc/malloc.c
> @@ -4403,6 +4403,7 @@ _int_pvalloc(mstate av, size_t bytes)
> /*
> ------------------------------ malloc_trim ------------------------------
> */
> +extern int vrange_enabled;
>
> static int mtrim(mstate av, size_t pad)
> {
> @@ -4443,7 +4444,12 @@ static int mtrim(mstate av, size_t pad)
> content. */
> memset (paligned_mem, 0x89, size & ~psm1);
> #endif
> - __madvise (paligned_mem, size & ~psm1, MADV_DONTNEED);
> +
> + if (vrange_enabled) {
> + syscall(314, paligned_mem, size & ~psm1, 0, 1);
> + } else {
> + __madvise (paligned_mem, size & ~psm1, MADV_DONTNEED);
> + }
>
> result = 1;
> }
> (END)

I can't find VRANGE_NOVOLATILE call in your patch so I think you just
test with enough memory. I expect you wanted just fast prototype to see
the performance gain. Yes, looks good although above code isn't complete
since it doesn't handle purged pages.

Next step of us would optimize reclaim path so when memory pressure is
severe, vrange could have a improved result rather than vanilla for
avoing swap out. Any suggestion are welcome.

--
Kind regards,
Minchan Kim

2013-04-11 07:20:36

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [RFC v7 00/11] Support vrange for anonymous page

>>> DONTNEED makes sure user always can see zero-fill pages after
>>> he calls madvise while vrange can see data or encounter SIGBUS.
>>
>> For replacing DONTNEED, user want to zero-fill pages like DONTNEED
>> instead of SIGBUS. So, new flag option would be nice.
>
> If userspace people want it, I can do it.
> But not sure they want it at the moment becaue vrange is rather
> different concept of madvise(DONTNEED) POV usage.
>
> As you know well, in case of DONTNEED, user calls madvise _once_ and
> VM releases memory as soon as he called system call.
> But vrange is same with delayed free when the system memory pressure
> happens so user can't know OS frees the pages anytime.
> It means user should call pair of system call both VRANGE_VOLATILE
> and VRANGE_NOVOLATILE for right usage of volatile range
> (for simple, I don't want to tell SIGBUS fault recovery method).
> If he took a mistake(ie, NOT to call VRANGE_NOVOLATILE) on the range
> which is used by current process, pages used by some process could be
> disappeared suddenly.
>
> In summary, I don't think vrange is a replacement of madvise(DONTNEED)
> but could be useful with madvise(DONTNEED) friend. For example, we can
> make return 1 in vrange(VRANGE_VOLATILE) if memory pressure was already

Do you mean vrange(VRANGE_UNVOLATILE)?
btw, assign new error number to asm-generic/errno.h is better than strange '1'.


> severe so user can catch up memory pressure by return value and calls
> madvise(DONTNEED) if memory pressure was already severe. Of course, we
> can handle it vrange system call itself(ex, change vrange system call to
> madvise(DONTNEED) but don't want it because I want to keep vrange hinting
> sytem call very light at all times so user can expect latency.

For allocator usage, vrange(UNVOLATILE) is annoying and don't need at all.
When data has already been purged, just return new zero filled page. so,
maybe adding new flag is worthwhile. Because malloc is definitely fast path
and adding new syscall invokation is unwelcome.


>> # of # of # of
>> thread iter iter (patched glibc)
>
> What's the workload?

Ahh, sorry. I forgot to write. I use ebizzy, your favolite workload.


2013-04-11 08:02:49

by Minchan Kim

[permalink] [raw]
Subject: Re: [RFC v7 00/11] Support vrange for anonymous page

On Thu, Apr 11, 2013 at 03:20:30AM -0400, KOSAKI Motohiro wrote:
> >>> DONTNEED makes sure user always can see zero-fill pages after
> >>> he calls madvise while vrange can see data or encounter SIGBUS.
> >>
> >> For replacing DONTNEED, user want to zero-fill pages like DONTNEED
> >> instead of SIGBUS. So, new flag option would be nice.
> >
> > If userspace people want it, I can do it.
> > But not sure they want it at the moment becaue vrange is rather
> > different concept of madvise(DONTNEED) POV usage.
> >
> > As you know well, in case of DONTNEED, user calls madvise _once_ and
> > VM releases memory as soon as he called system call.
> > But vrange is same with delayed free when the system memory pressure
> > happens so user can't know OS frees the pages anytime.
> > It means user should call pair of system call both VRANGE_VOLATILE
> > and VRANGE_NOVOLATILE for right usage of volatile range
> > (for simple, I don't want to tell SIGBUS fault recovery method).
> > If he took a mistake(ie, NOT to call VRANGE_NOVOLATILE) on the range
> > which is used by current process, pages used by some process could be
> > disappeared suddenly.
> >
> > In summary, I don't think vrange is a replacement of madvise(DONTNEED)
> > but could be useful with madvise(DONTNEED) friend. For example, we can
> > make return 1 in vrange(VRANGE_VOLATILE) if memory pressure was already
>
> Do you mean vrange(VRANGE_UNVOLATILE)?

I meant VRANGE_VOLATILE. It seems my explanation was poor. Here it goes, again.
Now vrange's semantic return just 0 if the system call is successful, otherwise,
return error. But we can change it as folows

1. return 0 if the system call is successful and memory pressure isn't severe
2. return 1 if the system call is successful and memory pressure is severe
3. return -ERRXXX if the system call is failed by some reason

So the process can know system-wide memory pressure without peeking the vmstat
and then call madvise(DONTNEED) right after vrange call. The benefit is system
can zap all pages instantly.

> btw, assign new error number to asm-generic/errno.h is better than strange '1'.

I can and admit "1" is rather weired.
But it's not error, either.

>
>
> > severe so user can catch up memory pressure by return value and calls
> > madvise(DONTNEED) if memory pressure was already severe. Of course, we
> > can handle it vrange system call itself(ex, change vrange system call to
> > madvise(DONTNEED) but don't want it because I want to keep vrange hinting
> > sytem call very light at all times so user can expect latency.
>
> For allocator usage, vrange(UNVOLATILE) is annoying and don't need at all.
> When data has already been purged, just return new zero filled page. so,
> maybe adding new flag is worthwhile. Because malloc is definitely fast path

I really want it and it's exactly same with madvise(MADV_FREE).
But for implementation, we need page granularity someting in address range
in system call context like zap_pte_range(ex, clear page table bits and
mark something to page flags for reclaimer to detect it).
It means vrange system call is still bigger although we are able to remove
lazy page fault.

Do you have any idea to remove it? If so, I'm very open to implement it.


> and adding new syscall invokation is unwelcome.

Sure. But one more system call could be cheaper than page-granuarity
operation on purged range.

>
>
> >> # of # of # of
> >> thread iter iter (patched glibc)
> >
> > What's the workload?
>
> Ahh, sorry. I forgot to write. I use ebizzy, your favolite workload.
>
>
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

--
Kind regards,
Minchan Kim

2013-04-11 08:15:48

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [RFC v7 00/11] Support vrange for anonymous page

(4/11/13 4:02 AM), Minchan Kim wrote:
> On Thu, Apr 11, 2013 at 03:20:30AM -0400, KOSAKI Motohiro wrote:
>>>>> DONTNEED makes sure user always can see zero-fill pages after
>>>>> he calls madvise while vrange can see data or encounter SIGBUS.
>>>>
>>>> For replacing DONTNEED, user want to zero-fill pages like DONTNEED
>>>> instead of SIGBUS. So, new flag option would be nice.
>>>
>>> If userspace people want it, I can do it.
>>> But not sure they want it at the moment becaue vrange is rather
>>> different concept of madvise(DONTNEED) POV usage.
>>>
>>> As you know well, in case of DONTNEED, user calls madvise _once_ and
>>> VM releases memory as soon as he called system call.
>>> But vrange is same with delayed free when the system memory pressure
>>> happens so user can't know OS frees the pages anytime.
>>> It means user should call pair of system call both VRANGE_VOLATILE
>>> and VRANGE_NOVOLATILE for right usage of volatile range
>>> (for simple, I don't want to tell SIGBUS fault recovery method).
>>> If he took a mistake(ie, NOT to call VRANGE_NOVOLATILE) on the range
>>> which is used by current process, pages used by some process could be
>>> disappeared suddenly.
>>>
>>> In summary, I don't think vrange is a replacement of madvise(DONTNEED)
>>> but could be useful with madvise(DONTNEED) friend. For example, we can
>>> make return 1 in vrange(VRANGE_VOLATILE) if memory pressure was already
>>
>> Do you mean vrange(VRANGE_UNVOLATILE)?
>
> I meant VRANGE_VOLATILE. It seems my explanation was poor. Here it goes, again.
> Now vrange's semantic return just 0 if the system call is successful, otherwise,
> return error. But we can change it as folows
>
> 1. return 0 if the system call is successful and memory pressure isn't severe
> 2. return 1 if the system call is successful and memory pressure is severe
> 3. return -ERRXXX if the system call is failed by some reason
>
> So the process can know system-wide memory pressure without peeking the vmstat
> and then call madvise(DONTNEED) right after vrange call. The benefit is system
> can zap all pages instantly.

Do you mean your patchset is not latest? and when do you use this feature? what's
happen VRANGE_VOLATILE return 0 and purge the range just after returning syscall.


>> btw, assign new error number to asm-generic/errno.h is better than strange '1'.
>
> I can and admit "1" is rather weired.
> But it's not error, either.

If this is really necessary, I don't oppose it. However I am still not convinced.



>>> severe so user can catch up memory pressure by return value and calls
>>> madvise(DONTNEED) if memory pressure was already severe. Of course, we
>>> can handle it vrange system call itself(ex, change vrange system call to
>>> madvise(DONTNEED) but don't want it because I want to keep vrange hinting
>>> sytem call very light at all times so user can expect latency.
>>
>> For allocator usage, vrange(UNVOLATILE) is annoying and don't need at all.
>> When data has already been purged, just return new zero filled page. so,
>> maybe adding new flag is worthwhile. Because malloc is definitely fast path
>
> I really want it and it's exactly same with madvise(MADV_FREE).
> But for implementation, we need page granularity someting in address range
> in system call context like zap_pte_range(ex, clear page table bits and
> mark something to page flags for reclaimer to detect it).
> It means vrange system call is still bigger although we are able to remove
> lazy page fault.
>
> Do you have any idea to remove it? If so, I'm very open to implement it.

Hm. Maybe I am missing something. I'll look the code closely after LFS.


>> and adding new syscall invokation is unwelcome.
>
> Sure. But one more system call could be cheaper than page-granuarity
> operation on purged range.

I don't think vrange(VOLATILE) cost is the related of this discusstion.
Whether sending SIGBUS or just nuke pte, purge should be done on vmscan,
not vrange() syscall.








2013-04-11 08:31:51

by Minchan Kim

[permalink] [raw]
Subject: Re: [RFC v7 00/11] Support vrange for anonymous page

On Thu, Apr 11, 2013 at 04:15:40AM -0400, KOSAKI Motohiro wrote:
> (4/11/13 4:02 AM), Minchan Kim wrote:
> > On Thu, Apr 11, 2013 at 03:20:30AM -0400, KOSAKI Motohiro wrote:
> >>>>> DONTNEED makes sure user always can see zero-fill pages after
> >>>>> he calls madvise while vrange can see data or encounter SIGBUS.
> >>>>
> >>>> For replacing DONTNEED, user want to zero-fill pages like DONTNEED
> >>>> instead of SIGBUS. So, new flag option would be nice.
> >>>
> >>> If userspace people want it, I can do it.
> >>> But not sure they want it at the moment becaue vrange is rather
> >>> different concept of madvise(DONTNEED) POV usage.
> >>>
> >>> As you know well, in case of DONTNEED, user calls madvise _once_ and
> >>> VM releases memory as soon as he called system call.
> >>> But vrange is same with delayed free when the system memory pressure
> >>> happens so user can't know OS frees the pages anytime.
> >>> It means user should call pair of system call both VRANGE_VOLATILE
> >>> and VRANGE_NOVOLATILE for right usage of volatile range
> >>> (for simple, I don't want to tell SIGBUS fault recovery method).
> >>> If he took a mistake(ie, NOT to call VRANGE_NOVOLATILE) on the range
> >>> which is used by current process, pages used by some process could be
> >>> disappeared suddenly.
> >>>
> >>> In summary, I don't think vrange is a replacement of madvise(DONTNEED)
> >>> but could be useful with madvise(DONTNEED) friend. For example, we can
> >>> make return 1 in vrange(VRANGE_VOLATILE) if memory pressure was already
> >>
> >> Do you mean vrange(VRANGE_UNVOLATILE)?
> >
> > I meant VRANGE_VOLATILE. It seems my explanation was poor. Here it goes, again.
> > Now vrange's semantic return just 0 if the system call is successful, otherwise,
> > return error. But we can change it as folows
> >
> > 1. return 0 if the system call is successful and memory pressure isn't severe
> > 2. return 1 if the system call is successful and memory pressure is severe
> > 3. return -ERRXXX if the system call is failed by some reason
> >
> > So the process can know system-wide memory pressure without peeking the vmstat
> > and then call madvise(DONTNEED) right after vrange call. The benefit is system
> > can zap all pages instantly.
>
> Do you mean your patchset is not latest? and when do you use this feature? what's

Yes. I meant I can it in next spin up for hearing the opinion.

> happen VRANGE_VOLATILE return 0 and purge the range just after returning syscall.

It could be an idea but I will think it over.

>
>
> >> btw, assign new error number to asm-generic/errno.h is better than strange '1'.
> >
> > I can and admit "1" is rather weired.
> > But it's not error, either.
>
> If this is really necessary, I don't oppose it. However I am still not convinced.
>
>
>
> >>> severe so user can catch up memory pressure by return value and calls
> >>> madvise(DONTNEED) if memory pressure was already severe. Of course, we
> >>> can handle it vrange system call itself(ex, change vrange system call to
> >>> madvise(DONTNEED) but don't want it because I want to keep vrange hinting
> >>> sytem call very light at all times so user can expect latency.
> >>
> >> For allocator usage, vrange(UNVOLATILE) is annoying and don't need at all.
> >> When data has already been purged, just return new zero filled page. so,
> >> maybe adding new flag is worthwhile. Because malloc is definitely fast path
> >
> > I really want it and it's exactly same with madvise(MADV_FREE).
> > But for implementation, we need page granularity someting in address range
> > in system call context like zap_pte_range(ex, clear page table bits and
> > mark something to page flags for reclaimer to detect it).
> > It means vrange system call is still bigger although we are able to remove
> > lazy page fault.
> >
> > Do you have any idea to remove it? If so, I'm very open to implement it.
>
> Hm. Maybe I am missing something. I'll look the code closely after LFS.

Please see the Rik's old work about MADV_FREE.

>
>
> >> and adding new syscall invokation is unwelcome.
> >
> > Sure. But one more system call could be cheaper than page-granuarity
> > operation on purged range.
>
> I don't think vrange(VOLATILE) cost is the related of this discusstion.
> Whether sending SIGBUS or just nuke pte, purge should be done on vmscan,
> not vrange() syscall.

Again, please see the MADV_FREE. http://lwn.net/Articles/230799/
It does changes pte and page flags on all pages of the range through
zap_pte_range. So it would make vrange(VOLASTILE) expensive and
the bigger cost is, the bigger range is.

>
>
>
>
>
>
>
>
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

--
Kind regards,
Minchan Kim

2013-04-11 15:01:15

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [RFC v7 00/11] Support vrange for anonymous page

>>>> and adding new syscall invokation is unwelcome.
>>>
>>> Sure. But one more system call could be cheaper than page-granuarity
>>> operation on purged range.
>>
>> I don't think vrange(VOLATILE) cost is the related of this discusstion.
>> Whether sending SIGBUS or just nuke pte, purge should be done on vmscan,
>> not vrange() syscall.
>
> Again, please see the MADV_FREE. http://lwn.net/Articles/230799/
> It does changes pte and page flags on all pages of the range through
> zap_pte_range. So it would make vrange(VOLASTILE) expensive and
> the bigger cost is, the bigger range is.

This haven't been crossed my mind. now try_to_discard_one() insert vrange
for making SIGBUS. then, we can insert pte_none() as the same cost too. Am
I missing something?

I couldn't imazine why pte should be zapping on vrange(VOLATILE).

2013-04-14 07:42:14

by Minchan Kim

[permalink] [raw]
Subject: Re: [RFC v7 00/11] Support vrange for anonymous page

Hi KOSAKI,

On Thu, Apr 11, 2013 at 11:01:11AM -0400, KOSAKI Motohiro wrote:
> >>>> and adding new syscall invokation is unwelcome.
> >>>
> >>> Sure. But one more system call could be cheaper than page-granuarity
> >>> operation on purged range.
> >>
> >> I don't think vrange(VOLATILE) cost is the related of this discusstion.
> >> Whether sending SIGBUS or just nuke pte, purge should be done on vmscan,
> >> not vrange() syscall.
> >
> > Again, please see the MADV_FREE. http://lwn.net/Articles/230799/
> > It does changes pte and page flags on all pages of the range through
> > zap_pte_range. So it would make vrange(VOLASTILE) expensive and
> > the bigger cost is, the bigger range is.
>
> This haven't been crossed my mind. now try_to_discard_one() insert vrange
> for making SIGBUS. then, we can insert pte_none() as the same cost too. Am
> I missing something?

For your requirement, we need some tracking model to detect some page is
using by the process currently before VM discards it *if* we don't give
vrange(NOVOLATILE) pair system call(Look at below). So the tracking model
should be formed in vrange(VOLATILE) system call context.

>
> I couldn't imazine why pte should be zapping on vrange(VOLATILE).

Sorry, my explanation was too bad to understand.
I will try again.

First of all, thing you want is almost like MADV_FREE.
So let's look at it firstly.

If you call madvise(range, MADV_FREE), VM should investigate all of
pages mapped at page table for range(start, start + len) so we need
page table lookup for the range and mark a flag to all page descriptor
(ex,PG_lazyfree) to give hint to kernel for discarding the page instead of
swappint out when reclaim happens. Another thing we need is to clear out
a dirty bit from PTE to detect the pages is dirtied or not, since we call
madvise(range, MADV_FREE) because we can't discard them, which are using by
some process since he called madvise. So if VM find the page has PG_lazyfree
but the page is dirtied recenlty by peeking PTE, VM can't discard the page.
So madivse system call's overhead is folloinwg as in madvise(MADV_FREE)

1. look up all pages from page table for the range.
2. mark some bit(PG_lazyfree) for page descriptors of pages mapped at range
3. clear dirty bit and TLB flush

So, madvise(MADV_FREE) would be better than madvise(DONTNEED) because it can
avoid page fault if memory pressure doesn't happen but system call overhead
could be still huge and expecially the overhead is increased proportionally
by range size.

Let's talk about vrange(range, VOLATILE)
The overhead of it is very small, which is just mark a flag into a
structure which represents the range (ie, struct vrange). When VM want to reclaim
some pages, VM find a page is mapped at VOLATILE area, so it could discard it
instead of swapping out. It moves the ovehead from system call itself to
VM reclaim path which is very slow path in the system and I think it's desirable
design(And that's why we have rmap).
But the problem is remained. VM can't detect page using by process after he calls
vrange(range, VOLATILE) because we didn't do anything in vrange(VOLATILE) so
VM might discard the page under the process. It didn't happen in madvise(MADV_FREE)
because it cleared out dirty bit of PTE to detect the page is used or not
since madvise is called.

Solution in vrange is to make new vrange(range, NOVOLATILE) system call, which give
the hint to kernel for preventing descarding pages in the range any more.
The cost of vrange(range, NOVOLATILE) is very small, too.
It just clear out the flags from a struct vrange which represents a range.

So I think calling of pair system call about volatile would be cheaper than a
only madvise(MADV_FREE).

I hope it helps your understanding but not sure because I am writing this
in airport which are very hard to focus my work. :(

>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

--
Kind regards,
Minchan Kim

2013-04-16 03:33:15

by John Stultz

[permalink] [raw]
Subject: Re: [RFC v7 00/11] Support vrange for anonymous page

On 04/14/2013 12:42 AM, Minchan Kim wrote:
> Hi KOSAKI,
>
> On Thu, Apr 11, 2013 at 11:01:11AM -0400, KOSAKI Motohiro wrote:
>>>>>> and adding new syscall invokation is unwelcome.
>>>>> Sure. But one more system call could be cheaper than page-granuarity
>>>>> operation on purged range.
>>>> I don't think vrange(VOLATILE) cost is the related of this discusstion.
>>>> Whether sending SIGBUS or just nuke pte, purge should be done on vmscan,
>>>> not vrange() syscall.
>>> Again, please see the MADV_FREE. http://lwn.net/Articles/230799/
>>> It does changes pte and page flags on all pages of the range through
>>> zap_pte_range. So it would make vrange(VOLASTILE) expensive and
>>> the bigger cost is, the bigger range is.
>> This haven't been crossed my mind. now try_to_discard_one() insert vrange
>> for making SIGBUS. then, we can insert pte_none() as the same cost too. Am
>> I missing something?
> For your requirement, we need some tracking model to detect some page is
> using by the process currently before VM discards it *if* we don't give
> vrange(NOVOLATILE) pair system call(Look at below). So the tracking model
> should be formed in vrange(VOLATILE) system call context.

To further clarify Minchan's note here, the reason its important for the
application to use vrange(NOVOLATILE), its really to help define _when
the range stops being volatile_.

In your libc hack to use vrange(), you see the benfit of not immediately
purging the memory as you do with MADV_DONTNEED. However, if the heap
grows again, and those address are re-used, nothing has stopped those
pages from continuing to be volatile. Thus the kernel could then decide
to purge those pages after they start to be used again, and you'd lose
data. I suspect that's not what you want. :)

Rik's MADV_FREE implementation is very similar to vrange(VOLATILE), but
has an implicit vrange(NOVOLATILE) on any page write. So by dirtying a
page, it stops the kernel from later purging it.

This MADV_FREE semantic works very well if you always want zerofill (as
in the case of malloc/free). But for other data, its important to know
something was lost (as a zero page could be valid data), and that's why
we provide the SIGBUS, as well as the purged notification on
vrange(NOVOLATILE).

In other-words, as long as you do a vrange(NOVOLATILE) when you grow the
heap again (before its used), it should be very similar to the MADV_FREE
behavior, but is more flexible for other use cases.

thanks
-john