Here is a series of nine patches against 2.6.32-rc7-mm1, at last making
KSM's shared pages swappable. The main patches, 2 3 and 4, have been
around for over a month; but I underestimated the tail of the job,
working out the right compromises to deal with the consequences of
having ksm pages on the LRUs.
Documentation/vm/ksm.txt | 22 -
include/linux/ksm.h | 71 ++++
include/linux/migrate.h | 8
include/linux/rmap.h | 35 ++
mm/Kconfig | 2
mm/internal.h | 3
mm/ksm.c | 567 ++++++++++++++++++++++++++++---------
mm/memcontrol.c | 7
mm/memory.c | 6
mm/memory_hotplug.c | 2
mm/mempolicy.c | 19 -
mm/migrate.c | 112 ++-----
mm/mlock.c | 4
mm/rmap.c | 151 +++++++--
mm/swapfile.c | 11
15 files changed, 741 insertions(+), 279 deletions(-)
Thanks!
Hugh
When KSM merges an mlocked page, it has been forgetting to munlock it:
that's been left to free_page_mlock(), which reports it in /proc/vmstat
as unevictable_pgs_mlockfreed instead of unevictable_pgs_munlocked (and
whinges "Page flag mlocked set for process" in mmotm, whereas mainline
is silently forgiving). Call munlock_vma_page() to fix that.
Signed-off-by: Hugh Dickins <[email protected]>
---
Is this a fix that I ought to backport to 2.6.32? It does rely on part of
an earlier patch (moved unlock_page down), so does not apply cleanly as is.
mm/internal.h | 3 ++-
mm/ksm.c | 4 ++++
mm/mlock.c | 4 ++--
3 files changed, 8 insertions(+), 3 deletions(-)
--- ksm0/mm/internal.h 2009-11-14 10:17:02.000000000 +0000
+++ ksm1/mm/internal.h 2009-11-22 20:39:56.000000000 +0000
@@ -105,9 +105,10 @@ static inline int is_mlocked_vma(struct
}
/*
- * must be called with vma's mmap_sem held for read, and page locked.
+ * must be called with vma's mmap_sem held for read or write, and page locked.
*/
extern void mlock_vma_page(struct page *page);
+extern void munlock_vma_page(struct page *page);
/*
* Clear the page's PageMlocked(). This can be useful in a situation where
--- ksm0/mm/ksm.c 2009-11-14 10:17:02.000000000 +0000
+++ ksm1/mm/ksm.c 2009-11-22 20:39:56.000000000 +0000
@@ -34,6 +34,7 @@
#include <linux/ksm.h>
#include <asm/tlbflush.h>
+#include "internal.h"
/*
* A few notes about the KSM scanning process,
@@ -762,6 +763,9 @@ static int try_to_merge_one_page(struct
pages_identical(page, kpage))
err = replace_page(vma, page, kpage, orig_pte);
+ if ((vma->vm_flags & VM_LOCKED) && !err)
+ munlock_vma_page(page);
+
unlock_page(page);
out:
return err;
--- ksm0/mm/mlock.c 2009-11-14 10:17:02.000000000 +0000
+++ ksm1/mm/mlock.c 2009-11-22 20:39:56.000000000 +0000
@@ -99,14 +99,14 @@ void mlock_vma_page(struct page *page)
* not get another chance to clear PageMlocked. If we successfully
* isolate the page and try_to_munlock() detects other VM_LOCKED vmas
* mapping the page, it will restore the PageMlocked state, unless the page
- * is mapped in a non-linear vma. So, we go ahead and SetPageMlocked(),
+ * is mapped in a non-linear vma. So, we go ahead and ClearPageMlocked(),
* perhaps redundantly.
* If we lose the isolation race, and the page is mapped by other VM_LOCKED
* vmas, we'll detect this in vmscan--via try_to_munlock() or try_to_unmap()
* either of which will restore the PageMlocked state by calling
* mlock_vma_page() above, if it can grab the vma's mmap sem.
*/
-static void munlock_vma_page(struct page *page)
+void munlock_vma_page(struct page *page)
{
BUG_ON(!PageLocked(page));
Initial implementation for swapping out KSM's shared pages: add
page_referenced_ksm() and try_to_unmap_ksm(), which rmap.c calls
when faced with a PageKsm page.
Most of what's needed can be got from the rmap_items listed from
the stable_node of the ksm page, without discovering the actual vma:
so in this patch just fake up a struct vma for page_referenced_one()
or try_to_unmap_one(), then refine that in the next patch.
Add VM_NONLINEAR to ksm_madvise()'s list of exclusions: it has always
been implicit there (being only set with VM_SHARED, already excluded),
but let's make it explicit, to help justify the lack of nonlinear unmap.
Rely on the page lock to protect against concurrent modifications to
that page's node of the stable tree.
The awkward part is not swapout but swapin: do_swap_page() and
page_add_anon_rmap() now have to allow for new possibilities - perhaps
a ksm page still in swapcache, perhaps a swapcache page associated with
one location in one anon_vma now needed for another location or anon_vma.
(And the vma might even be no longer VM_MERGEABLE when that happens.)
ksm_might_need_to_copy() checks for that case, and supplies a duplicate
page when necessary, simply leaving it to a subsequent pass of ksmd to
rediscover the identity and merge them back into one ksm page.
Disappointingly primitive: but the alternative would have to accumulate
unswappable info about the swapped out ksm pages, limiting swappability.
Remove page_add_ksm_rmap(): page_add_anon_rmap() now has to allow for
the particular case it was handling, so just use it instead.
Signed-off-by: Hugh Dickins <[email protected]>
---
include/linux/ksm.h | 54 +++++++++++-
include/linux/rmap.h | 5 +
mm/ksm.c | 172 +++++++++++++++++++++++++++++++++++++----
mm/memory.c | 6 +
mm/rmap.c | 65 +++++++++------
mm/swapfile.c | 11 ++
6 files changed, 264 insertions(+), 49 deletions(-)
--- ksm1/include/linux/ksm.h 2009-11-14 10:17:02.000000000 +0000
+++ ksm2/include/linux/ksm.h 2009-11-22 20:40:04.000000000 +0000
@@ -9,10 +9,12 @@
#include <linux/bitops.h>
#include <linux/mm.h>
+#include <linux/pagemap.h>
+#include <linux/rmap.h>
#include <linux/sched.h>
-#include <linux/vmstat.h>
struct stable_node;
+struct mem_cgroup;
#ifdef CONFIG_KSM
int ksm_madvise(struct vm_area_struct *vma, unsigned long start,
@@ -57,11 +59,36 @@ static inline void set_page_stable_node(
(PAGE_MAPPING_ANON | PAGE_MAPPING_KSM);
}
-static inline void page_add_ksm_rmap(struct page *page)
+/*
+ * When do_swap_page() first faults in from swap what used to be a KSM page,
+ * no problem, it will be assigned to this vma's anon_vma; but thereafter,
+ * it might be faulted into a different anon_vma (or perhaps to a different
+ * offset in the same anon_vma). do_swap_page() cannot do all the locking
+ * needed to reconstitute a cross-anon_vma KSM page: for now it has to make
+ * a copy, and leave remerging the pages to a later pass of ksmd.
+ *
+ * We'd like to make this conditional on vma->vm_flags & VM_MERGEABLE,
+ * but what if the vma was unmerged while the page was swapped out?
+ */
+struct page *ksm_does_need_to_copy(struct page *page,
+ struct vm_area_struct *vma, unsigned long address);
+static inline struct page *ksm_might_need_to_copy(struct page *page,
+ struct vm_area_struct *vma, unsigned long address)
{
- if (atomic_inc_and_test(&page->_mapcount))
- __inc_zone_page_state(page, NR_ANON_PAGES);
+ struct anon_vma *anon_vma = page_anon_vma(page);
+
+ if (!anon_vma ||
+ (anon_vma == vma->anon_vma &&
+ page->index == linear_page_index(vma, address)))
+ return page;
+
+ return ksm_does_need_to_copy(page, vma, address);
}
+
+int page_referenced_ksm(struct page *page,
+ struct mem_cgroup *memcg, unsigned long *vm_flags);
+int try_to_unmap_ksm(struct page *page, enum ttu_flags flags);
+
#else /* !CONFIG_KSM */
static inline int ksm_madvise(struct vm_area_struct *vma, unsigned long start,
@@ -84,7 +111,22 @@ static inline int PageKsm(struct page *p
return 0;
}
-/* No stub required for page_add_ksm_rmap(page) */
+static inline struct page *ksm_might_need_to_copy(struct page *page,
+ struct vm_area_struct *vma, unsigned long address)
+{
+ return page;
+}
+
+static inline int page_referenced_ksm(struct page *page,
+ struct mem_cgroup *memcg, unsigned long *vm_flags)
+{
+ return 0;
+}
+
+static inline int try_to_unmap_ksm(struct page *page, enum ttu_flags flags)
+{
+ return 0;
+}
#endif /* !CONFIG_KSM */
-#endif
+#endif /* __LINUX_KSM_H */
--- ksm1/include/linux/rmap.h 2009-11-14 10:17:02.000000000 +0000
+++ ksm2/include/linux/rmap.h 2009-11-22 20:40:04.000000000 +0000
@@ -89,6 +89,9 @@ static inline void page_dup_rmap(struct
*/
int page_referenced(struct page *, int is_locked,
struct mem_cgroup *cnt, unsigned long *vm_flags);
+int page_referenced_one(struct page *, struct vm_area_struct *,
+ unsigned long address, unsigned int *mapcount, unsigned long *vm_flags);
+
enum ttu_flags {
TTU_UNMAP = 0, /* unmap mode */
TTU_MIGRATION = 1, /* migration mode */
@@ -102,6 +105,8 @@ enum ttu_flags {
#define TTU_ACTION(x) ((x) & TTU_ACTION_MASK)
int try_to_unmap(struct page *, enum ttu_flags flags);
+int try_to_unmap_one(struct page *, struct vm_area_struct *,
+ unsigned long address, enum ttu_flags flags);
/*
* Called from mm/filemap_xip.c to unmap empty zero page
--- ksm1/mm/ksm.c 2009-11-22 20:39:56.000000000 +0000
+++ ksm2/mm/ksm.c 2009-11-22 20:40:04.000000000 +0000
@@ -196,6 +196,13 @@ static DECLARE_WAIT_QUEUE_HEAD(ksm_threa
static DEFINE_MUTEX(ksm_thread_mutex);
static DEFINE_SPINLOCK(ksm_mmlist_lock);
+/*
+ * Temporary hack for page_referenced_ksm() and try_to_unmap_ksm(),
+ * later we rework things a little to get the right vma to them.
+ */
+static DEFINE_SPINLOCK(ksm_fallback_vma_lock);
+static struct vm_area_struct ksm_fallback_vma;
+
#define KSM_KMEM_CACHE(__struct, __flags) kmem_cache_create("ksm_"#__struct,\
sizeof(struct __struct), __alignof__(struct __struct),\
(__flags), NULL)
@@ -445,14 +452,20 @@ static void remove_rmap_item_from_tree(s
{
if (rmap_item->address & STABLE_FLAG) {
struct stable_node *stable_node;
+ struct page *page;
stable_node = rmap_item->head;
+ page = stable_node->page;
+ lock_page(page);
+
hlist_del(&rmap_item->hlist);
- if (stable_node->hlist.first)
+ if (stable_node->hlist.first) {
+ unlock_page(page);
ksm_pages_sharing--;
- else {
- set_page_stable_node(stable_node->page, NULL);
- put_page(stable_node->page);
+ } else {
+ set_page_stable_node(page, NULL);
+ unlock_page(page);
+ put_page(page);
rb_erase(&stable_node->node, &root_stable_tree);
free_stable_node(stable_node);
@@ -710,7 +723,7 @@ static int replace_page(struct vm_area_s
}
get_page(kpage);
- page_add_ksm_rmap(kpage);
+ page_add_anon_rmap(kpage, vma, addr);
flush_cache_page(vma, addr, pte_pfn(*ptep));
ptep_clear_flush(vma, addr, ptep);
@@ -763,8 +776,16 @@ static int try_to_merge_one_page(struct
pages_identical(page, kpage))
err = replace_page(vma, page, kpage, orig_pte);
- if ((vma->vm_flags & VM_LOCKED) && !err)
+ if ((vma->vm_flags & VM_LOCKED) && !err) {
munlock_vma_page(page);
+ if (!PageMlocked(kpage)) {
+ unlock_page(page);
+ lru_add_drain();
+ lock_page(kpage);
+ mlock_vma_page(kpage);
+ page = kpage; /* for final unlock */
+ }
+ }
unlock_page(page);
out:
@@ -841,7 +862,11 @@ static struct page *try_to_merge_two_pag
copy_user_highpage(kpage, page, rmap_item->address, vma);
+ SetPageDirty(kpage);
+ __SetPageUptodate(kpage);
+ SetPageSwapBacked(kpage);
set_page_stable_node(kpage, NULL); /* mark it PageKsm */
+ lru_cache_add_lru(kpage, LRU_ACTIVE_ANON);
err = try_to_merge_one_page(vma, page, kpage);
up:
@@ -1071,7 +1096,9 @@ static void cmp_and_merge_page(struct pa
* The page was successfully merged:
* add its rmap_item to the stable tree.
*/
+ lock_page(kpage);
stable_tree_append(rmap_item, stable_node);
+ unlock_page(kpage);
}
put_page(kpage);
return;
@@ -1112,11 +1139,13 @@ static void cmp_and_merge_page(struct pa
if (kpage) {
remove_rmap_item_from_tree(tree_rmap_item);
+ lock_page(kpage);
stable_node = stable_tree_insert(kpage);
if (stable_node) {
stable_tree_append(tree_rmap_item, stable_node);
stable_tree_append(rmap_item, stable_node);
}
+ unlock_page(kpage);
put_page(kpage);
/*
@@ -1285,14 +1314,6 @@ static void ksm_do_scan(unsigned int sca
return;
if (!PageKsm(page) || !in_stable_tree(rmap_item))
cmp_and_merge_page(page, rmap_item);
- else if (page_mapcount(page) == 1) {
- /*
- * Replace now-unshared ksm page by ordinary page.
- */
- break_cow(rmap_item);
- remove_rmap_item_from_tree(rmap_item);
- rmap_item->oldchecksum = calc_checksum(page);
- }
put_page(page);
}
}
@@ -1337,7 +1358,7 @@ int ksm_madvise(struct vm_area_struct *v
if (*vm_flags & (VM_MERGEABLE | VM_SHARED | VM_MAYSHARE |
VM_PFNMAP | VM_IO | VM_DONTEXPAND |
VM_RESERVED | VM_HUGETLB | VM_INSERTPAGE |
- VM_MIXEDMAP | VM_SAO))
+ VM_NONLINEAR | VM_MIXEDMAP | VM_SAO))
return 0; /* just ignore the advice */
if (!test_bit(MMF_VM_MERGEABLE, &mm->flags)) {
@@ -1435,6 +1456,127 @@ void __ksm_exit(struct mm_struct *mm)
}
}
+struct page *ksm_does_need_to_copy(struct page *page,
+ struct vm_area_struct *vma, unsigned long address)
+{
+ struct page *new_page;
+
+ unlock_page(page); /* any racers will COW it, not modify it */
+
+ new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address);
+ if (new_page) {
+ copy_user_highpage(new_page, page, address, vma);
+
+ SetPageDirty(new_page);
+ __SetPageUptodate(new_page);
+ SetPageSwapBacked(new_page);
+ __set_page_locked(new_page);
+
+ if (page_evictable(new_page, vma))
+ lru_cache_add_lru(new_page, LRU_ACTIVE_ANON);
+ else
+ add_page_to_unevictable_list(new_page);
+ }
+
+ page_cache_release(page);
+ return new_page;
+}
+
+int page_referenced_ksm(struct page *page, struct mem_cgroup *memcg,
+ unsigned long *vm_flags)
+{
+ struct stable_node *stable_node;
+ struct rmap_item *rmap_item;
+ struct hlist_node *hlist;
+ unsigned int mapcount = page_mapcount(page);
+ int referenced = 0;
+ struct vm_area_struct *vma;
+
+ VM_BUG_ON(!PageKsm(page));
+ VM_BUG_ON(!PageLocked(page));
+
+ stable_node = page_stable_node(page);
+ if (!stable_node)
+ return 0;
+
+ /*
+ * Temporary hack: really we need anon_vma in rmap_item, to
+ * provide the correct vma, and to find recently forked instances.
+ * Use zalloc to avoid weirdness if any other fields are involved.
+ */
+ vma = kmem_cache_zalloc(vm_area_cachep, GFP_ATOMIC);
+ if (!vma) {
+ spin_lock(&ksm_fallback_vma_lock);
+ vma = &ksm_fallback_vma;
+ }
+
+ hlist_for_each_entry(rmap_item, hlist, &stable_node->hlist, hlist) {
+ if (memcg && !mm_match_cgroup(rmap_item->mm, memcg))
+ continue;
+
+ vma->vm_mm = rmap_item->mm;
+ vma->vm_start = rmap_item->address;
+ vma->vm_end = vma->vm_start + PAGE_SIZE;
+
+ referenced += page_referenced_one(page, vma,
+ rmap_item->address, &mapcount, vm_flags);
+ if (!mapcount)
+ goto out;
+ }
+out:
+ if (vma == &ksm_fallback_vma)
+ spin_unlock(&ksm_fallback_vma_lock);
+ else
+ kmem_cache_free(vm_area_cachep, vma);
+ return referenced;
+}
+
+int try_to_unmap_ksm(struct page *page, enum ttu_flags flags)
+{
+ struct stable_node *stable_node;
+ struct hlist_node *hlist;
+ struct rmap_item *rmap_item;
+ int ret = SWAP_AGAIN;
+ struct vm_area_struct *vma;
+
+ VM_BUG_ON(!PageKsm(page));
+ VM_BUG_ON(!PageLocked(page));
+
+ stable_node = page_stable_node(page);
+ if (!stable_node)
+ return SWAP_FAIL;
+
+ /*
+ * Temporary hack: really we need anon_vma in rmap_item, to
+ * provide the correct vma, and to find recently forked instances.
+ * Use zalloc to avoid weirdness if any other fields are involved.
+ */
+ if (TTU_ACTION(flags) != TTU_UNMAP)
+ return SWAP_FAIL;
+
+ vma = kmem_cache_zalloc(vm_area_cachep, GFP_ATOMIC);
+ if (!vma) {
+ spin_lock(&ksm_fallback_vma_lock);
+ vma = &ksm_fallback_vma;
+ }
+
+ hlist_for_each_entry(rmap_item, hlist, &stable_node->hlist, hlist) {
+ vma->vm_mm = rmap_item->mm;
+ vma->vm_start = rmap_item->address;
+ vma->vm_end = vma->vm_start + PAGE_SIZE;
+
+ ret = try_to_unmap_one(page, vma, rmap_item->address, flags);
+ if (ret != SWAP_AGAIN || !page_mapped(page))
+ goto out;
+ }
+out:
+ if (vma == &ksm_fallback_vma)
+ spin_unlock(&ksm_fallback_vma_lock);
+ else
+ kmem_cache_free(vm_area_cachep, vma);
+ return ret;
+}
+
#ifdef CONFIG_SYSFS
/*
* This all compiles without CONFIG_SYSFS, but is a waste of space.
--- ksm1/mm/memory.c 2009-11-14 10:17:02.000000000 +0000
+++ ksm2/mm/memory.c 2009-11-22 20:40:04.000000000 +0000
@@ -2563,6 +2563,12 @@ static int do_swap_page(struct mm_struct
lock_page(page);
delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
+ page = ksm_might_need_to_copy(page, vma, address);
+ if (!page) {
+ ret = VM_FAULT_OOM;
+ goto out;
+ }
+
if (mem_cgroup_try_charge_swapin(mm, page, GFP_KERNEL, &ptr)) {
ret = VM_FAULT_OOM;
goto out_page;
--- ksm1/mm/rmap.c 2009-11-14 10:17:02.000000000 +0000
+++ ksm2/mm/rmap.c 2009-11-22 20:40:04.000000000 +0000
@@ -49,6 +49,7 @@
#include <linux/swapops.h>
#include <linux/slab.h>
#include <linux/init.h>
+#include <linux/ksm.h>
#include <linux/rmap.h>
#include <linux/rcupdate.h>
#include <linux/module.h>
@@ -336,9 +337,9 @@ int page_mapped_in_vma(struct page *page
* Subfunctions of page_referenced: page_referenced_one called
* repeatedly from either page_referenced_anon or page_referenced_file.
*/
-static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
- unsigned long address, unsigned int *mapcount,
- unsigned long *vm_flags)
+int page_referenced_one(struct page *page, struct vm_area_struct *vma,
+ unsigned long address, unsigned int *mapcount,
+ unsigned long *vm_flags)
{
struct mm_struct *mm = vma->vm_mm;
pte_t *pte;
@@ -507,28 +508,33 @@ int page_referenced(struct page *page,
unsigned long *vm_flags)
{
int referenced = 0;
+ int we_locked = 0;
if (TestClearPageReferenced(page))
referenced++;
*vm_flags = 0;
if (page_mapped(page) && page_rmapping(page)) {
- if (PageAnon(page))
+ if (!is_locked && (!PageAnon(page) || PageKsm(page))) {
+ we_locked = trylock_page(page);
+ if (!we_locked) {
+ referenced++;
+ goto out;
+ }
+ }
+ if (unlikely(PageKsm(page)))
+ referenced += page_referenced_ksm(page, mem_cont,
+ vm_flags);
+ else if (PageAnon(page))
referenced += page_referenced_anon(page, mem_cont,
vm_flags);
- else if (is_locked)
+ else if (page->mapping)
referenced += page_referenced_file(page, mem_cont,
vm_flags);
- else if (!trylock_page(page))
- referenced++;
- else {
- if (page->mapping)
- referenced += page_referenced_file(page,
- mem_cont, vm_flags);
+ if (we_locked)
unlock_page(page);
- }
}
-
+out:
if (page_test_and_clear_young(page))
referenced++;
@@ -620,14 +626,7 @@ static void __page_set_anon_rmap(struct
BUG_ON(!anon_vma);
anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON;
page->mapping = (struct address_space *) anon_vma;
-
page->index = linear_page_index(vma, address);
-
- /*
- * nr_mapped state can be updated without turning off
- * interrupts because it is not modified via interrupt.
- */
- __inc_zone_page_state(page, NR_ANON_PAGES);
}
/**
@@ -665,14 +664,21 @@ static void __page_check_anon_rmap(struc
* @vma: the vm area in which the mapping is added
* @address: the user virtual address mapped
*
- * The caller needs to hold the pte lock and the page must be locked.
+ * The caller needs to hold the pte lock, and the page must be locked in
+ * the anon_vma case: to serialize mapping,index checking after setting.
*/
void page_add_anon_rmap(struct page *page,
struct vm_area_struct *vma, unsigned long address)
{
+ int first = atomic_inc_and_test(&page->_mapcount);
+ if (first)
+ __inc_zone_page_state(page, NR_ANON_PAGES);
+ if (unlikely(PageKsm(page)))
+ return;
+
VM_BUG_ON(!PageLocked(page));
VM_BUG_ON(address < vma->vm_start || address >= vma->vm_end);
- if (atomic_inc_and_test(&page->_mapcount))
+ if (first)
__page_set_anon_rmap(page, vma, address);
else
__page_check_anon_rmap(page, vma, address);
@@ -694,6 +700,7 @@ void page_add_new_anon_rmap(struct page
VM_BUG_ON(address < vma->vm_start || address >= vma->vm_end);
SetPageSwapBacked(page);
atomic_set(&page->_mapcount, 0); /* increment count (starts at -1) */
+ __inc_zone_page_state(page, NR_ANON_PAGES);
__page_set_anon_rmap(page, vma, address);
if (page_evictable(page, vma))
lru_cache_add_lru(page, LRU_ACTIVE_ANON);
@@ -760,8 +767,8 @@ void page_remove_rmap(struct page *page)
* Subfunctions of try_to_unmap: try_to_unmap_one called
* repeatedly from either try_to_unmap_anon or try_to_unmap_file.
*/
-static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
- unsigned long address, enum ttu_flags flags)
+int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
+ unsigned long address, enum ttu_flags flags)
{
struct mm_struct *mm = vma->vm_mm;
pte_t *pte;
@@ -1152,7 +1159,9 @@ int try_to_unmap(struct page *page, enum
BUG_ON(!PageLocked(page));
- if (PageAnon(page))
+ if (unlikely(PageKsm(page)))
+ ret = try_to_unmap_ksm(page, flags);
+ else if (PageAnon(page))
ret = try_to_unmap_anon(page, flags);
else
ret = try_to_unmap_file(page, flags);
@@ -1173,15 +1182,17 @@ int try_to_unmap(struct page *page, enum
*
* SWAP_AGAIN - no vma is holding page mlocked, or,
* SWAP_AGAIN - page mapped in mlocked vma -- couldn't acquire mmap sem
+ * SWAP_FAIL - page cannot be located at present
* SWAP_MLOCK - page is now mlocked.
*/
int try_to_munlock(struct page *page)
{
VM_BUG_ON(!PageLocked(page) || PageLRU(page));
- if (PageAnon(page))
+ if (unlikely(PageKsm(page)))
+ return try_to_unmap_ksm(page, TTU_MUNLOCK);
+ else if (PageAnon(page))
return try_to_unmap_anon(page, TTU_MUNLOCK);
else
return try_to_unmap_file(page, TTU_MUNLOCK);
}
-
--- ksm1/mm/swapfile.c 2009-11-14 10:17:02.000000000 +0000
+++ ksm2/mm/swapfile.c 2009-11-22 20:40:04.000000000 +0000
@@ -22,6 +22,7 @@
#include <linux/seq_file.h>
#include <linux/init.h>
#include <linux/module.h>
+#include <linux/ksm.h>
#include <linux/rmap.h>
#include <linux/security.h>
#include <linux/backing-dev.h>
@@ -649,6 +650,8 @@ int reuse_swap_page(struct page *page)
int count;
VM_BUG_ON(!PageLocked(page));
+ if (unlikely(PageKsm(page)))
+ return 0;
count = page_mapcount(page);
if (count <= 1 && PageSwapCache(page)) {
count += page_swapcount(page);
@@ -657,7 +660,7 @@ int reuse_swap_page(struct page *page)
SetPageDirty(page);
}
}
- return count == 1;
+ return count <= 1;
}
/*
@@ -1184,6 +1187,12 @@ static int try_to_unuse(unsigned int typ
* read from disk into another page. Splitting into two
* pages would be incorrect if swap supported "shared
* private" pages, but they are handled by tmpfs files.
+ *
+ * Given how unuse_vma() targets one particular offset
+ * in an anon_vma, once the anon_vma has been determined,
+ * this splitting happens to be just what is needed to
+ * handle where KSM pages have been swapped out: re-reading
+ * is unnecessarily slow, but we can fix that later on.
*/
if (swap_count(*swap_map) &&
PageDirty(page) && PageSwapCache(page)) {
For full functionality, page_referenced_one() and try_to_unmap_one()
need to know the vma: to pass vma down to arch-dependent flushes,
or to observe VM_LOCKED or VM_EXEC. But KSM keeps no record of vma:
nor can it, since vmas get split and merged without its knowledge.
Instead, note page's anon_vma in its rmap_item when adding to stable
tree: all the vmas which might map that page are listed by its anon_vma.
page_referenced_ksm() and try_to_unmap_ksm() then traverse the anon_vma,
first to find the probable vma, that which matches rmap_item's mm; but
if that is not enough to locate all instances, traverse again to try the
others. This catches those occasions when fork has duplicated a pte of
a ksm page, but ksmd has not yet come around to assign it an rmap_item.
But each rmap_item in the stable tree which refers to an anon_vma needs
to take a reference to it. Andrea's anon_vma design cleverly avoided a
reference count (an anon_vma was free when its list of vmas was empty),
but KSM now needs to add that. Is a 32-bit count sufficient? I believe
so - the anon_vma is only free when both count is 0 and list is empty.
Signed-off-by: Hugh Dickins <[email protected]>
---
include/linux/rmap.h | 24 ++++++
mm/ksm.c | 153 ++++++++++++++++++++++++-----------------
mm/rmap.c | 5 -
3 files changed, 120 insertions(+), 62 deletions(-)
--- ksm2/include/linux/rmap.h 2009-11-22 20:40:04.000000000 +0000
+++ ksm3/include/linux/rmap.h 2009-11-22 20:40:11.000000000 +0000
@@ -26,6 +26,9 @@
*/
struct anon_vma {
spinlock_t lock; /* Serialize access to vma list */
+#ifdef CONFIG_KSM
+ atomic_t ksm_refcount;
+#endif
/*
* NOTE: the LSB of the head.next is set by
* mm_take_all_locks() _after_ taking the above lock. So the
@@ -38,6 +41,26 @@ struct anon_vma {
};
#ifdef CONFIG_MMU
+#ifdef CONFIG_KSM
+static inline void ksm_refcount_init(struct anon_vma *anon_vma)
+{
+ atomic_set(&anon_vma->ksm_refcount, 0);
+}
+
+static inline int ksm_refcount(struct anon_vma *anon_vma)
+{
+ return atomic_read(&anon_vma->ksm_refcount);
+}
+#else
+static inline void ksm_refcount_init(struct anon_vma *anon_vma)
+{
+}
+
+static inline int ksm_refcount(struct anon_vma *anon_vma)
+{
+ return 0;
+}
+#endif /* CONFIG_KSM */
static inline struct anon_vma *page_anon_vma(struct page *page)
{
@@ -70,6 +93,7 @@ void __anon_vma_merge(struct vm_area_str
void anon_vma_unlink(struct vm_area_struct *);
void anon_vma_link(struct vm_area_struct *);
void __anon_vma_link(struct vm_area_struct *);
+void anon_vma_free(struct anon_vma *);
/*
* rmap interfaces called when adding or removing pte of page
--- ksm2/mm/ksm.c 2009-11-22 20:40:04.000000000 +0000
+++ ksm3/mm/ksm.c 2009-11-22 20:40:11.000000000 +0000
@@ -121,7 +121,7 @@ struct stable_node {
/**
* struct rmap_item - reverse mapping item for virtual addresses
* @rmap_list: next rmap_item in mm_slot's singly-linked rmap_list
- * @filler: unused space we're making available in this patch
+ * @anon_vma: pointer to anon_vma for this mm,address, when in stable tree
* @mm: the memory structure this rmap_item is pointing into
* @address: the virtual address this rmap_item tracks (+ flags in low bits)
* @oldchecksum: previous checksum of the page at that virtual address
@@ -131,7 +131,7 @@ struct stable_node {
*/
struct rmap_item {
struct rmap_item *rmap_list;
- unsigned long filler;
+ struct anon_vma *anon_vma; /* when stable */
struct mm_struct *mm;
unsigned long address; /* + low bits used for flags below */
unsigned int oldchecksum; /* when unstable */
@@ -196,13 +196,6 @@ static DECLARE_WAIT_QUEUE_HEAD(ksm_threa
static DEFINE_MUTEX(ksm_thread_mutex);
static DEFINE_SPINLOCK(ksm_mmlist_lock);
-/*
- * Temporary hack for page_referenced_ksm() and try_to_unmap_ksm(),
- * later we rework things a little to get the right vma to them.
- */
-static DEFINE_SPINLOCK(ksm_fallback_vma_lock);
-static struct vm_area_struct ksm_fallback_vma;
-
#define KSM_KMEM_CACHE(__struct, __flags) kmem_cache_create("ksm_"#__struct,\
sizeof(struct __struct), __alignof__(struct __struct),\
(__flags), NULL)
@@ -323,6 +316,25 @@ static inline int in_stable_tree(struct
return rmap_item->address & STABLE_FLAG;
}
+static void hold_anon_vma(struct rmap_item *rmap_item,
+ struct anon_vma *anon_vma)
+{
+ rmap_item->anon_vma = anon_vma;
+ atomic_inc(&anon_vma->ksm_refcount);
+}
+
+static void drop_anon_vma(struct rmap_item *rmap_item)
+{
+ struct anon_vma *anon_vma = rmap_item->anon_vma;
+
+ if (atomic_dec_and_lock(&anon_vma->ksm_refcount, &anon_vma->lock)) {
+ int empty = list_empty(&anon_vma->head);
+ spin_unlock(&anon_vma->lock);
+ if (empty)
+ anon_vma_free(anon_vma);
+ }
+}
+
/*
* ksmd, and unmerge_and_remove_all_rmap_items(), must not touch an mm's
* page tables after it has passed through ksm_exit() - which, if necessary,
@@ -472,6 +484,7 @@ static void remove_rmap_item_from_tree(s
ksm_pages_shared--;
}
+ drop_anon_vma(rmap_item);
rmap_item->address &= PAGE_MASK;
} else if (rmap_item->address & UNSTABLE_FLAG) {
@@ -752,6 +765,9 @@ static int try_to_merge_one_page(struct
pte_t orig_pte = __pte(0);
int err = -EFAULT;
+ if (page == kpage) /* ksm page forked */
+ return 0;
+
if (!(vma->vm_flags & VM_MERGEABLE))
goto out;
if (!PageAnon(page))
@@ -805,9 +821,6 @@ static int try_to_merge_with_ksm_page(st
struct vm_area_struct *vma;
int err = -EFAULT;
- if (page == kpage) /* ksm page forked */
- return 0;
-
down_read(&mm->mmap_sem);
if (ksm_test_exit(mm))
goto out;
@@ -816,6 +829,11 @@ static int try_to_merge_with_ksm_page(st
goto out;
err = try_to_merge_one_page(vma, page, kpage);
+ if (err)
+ goto out;
+
+ /* Must get reference to anon_vma while still holding mmap_sem */
+ hold_anon_vma(rmap_item, vma->anon_vma);
out:
up_read(&mm->mmap_sem);
return err;
@@ -869,6 +887,11 @@ static struct page *try_to_merge_two_pag
lru_cache_add_lru(kpage, LRU_ACTIVE_ANON);
err = try_to_merge_one_page(vma, page, kpage);
+ if (err)
+ goto up;
+
+ /* Must get reference to anon_vma while still holding mmap_sem */
+ hold_anon_vma(rmap_item, vma->anon_vma);
up:
up_read(&mm->mmap_sem);
@@ -879,8 +902,10 @@ up:
* If that fails, we have a ksm page with only one pte
* pointing to it: so break it.
*/
- if (err)
+ if (err) {
+ drop_anon_vma(rmap_item);
break_cow(rmap_item);
+ }
}
if (err) {
put_page(kpage);
@@ -1155,7 +1180,9 @@ static void cmp_and_merge_page(struct pa
* in which case we need to break_cow on both.
*/
if (!stable_node) {
+ drop_anon_vma(tree_rmap_item);
break_cow(tree_rmap_item);
+ drop_anon_vma(rmap_item);
break_cow(rmap_item);
}
}
@@ -1490,7 +1517,7 @@ int page_referenced_ksm(struct page *pag
struct hlist_node *hlist;
unsigned int mapcount = page_mapcount(page);
int referenced = 0;
- struct vm_area_struct *vma;
+ int search_new_forks = 0;
VM_BUG_ON(!PageKsm(page));
VM_BUG_ON(!PageLocked(page));
@@ -1498,36 +1525,40 @@ int page_referenced_ksm(struct page *pag
stable_node = page_stable_node(page);
if (!stable_node)
return 0;
-
- /*
- * Temporary hack: really we need anon_vma in rmap_item, to
- * provide the correct vma, and to find recently forked instances.
- * Use zalloc to avoid weirdness if any other fields are involved.
- */
- vma = kmem_cache_zalloc(vm_area_cachep, GFP_ATOMIC);
- if (!vma) {
- spin_lock(&ksm_fallback_vma_lock);
- vma = &ksm_fallback_vma;
- }
-
+again:
hlist_for_each_entry(rmap_item, hlist, &stable_node->hlist, hlist) {
+ struct anon_vma *anon_vma = rmap_item->anon_vma;
+ struct vm_area_struct *vma;
+
if (memcg && !mm_match_cgroup(rmap_item->mm, memcg))
continue;
- vma->vm_mm = rmap_item->mm;
- vma->vm_start = rmap_item->address;
- vma->vm_end = vma->vm_start + PAGE_SIZE;
+ spin_lock(&anon_vma->lock);
+ list_for_each_entry(vma, &anon_vma->head, anon_vma_node) {
+ if (rmap_item->address < vma->vm_start ||
+ rmap_item->address >= vma->vm_end)
+ continue;
+ /*
+ * Initially we examine only the vma which covers this
+ * rmap_item; but later, if there is still work to do,
+ * we examine covering vmas in other mms: in case they
+ * were forked from the original since ksmd passed.
+ */
+ if ((rmap_item->mm == vma->vm_mm) == search_new_forks)
+ continue;
- referenced += page_referenced_one(page, vma,
+ referenced += page_referenced_one(page, vma,
rmap_item->address, &mapcount, vm_flags);
+ if (!search_new_forks || !mapcount)
+ break;
+ }
+ spin_unlock(&anon_vma->lock);
if (!mapcount)
goto out;
}
+ if (!search_new_forks++)
+ goto again;
out:
- if (vma == &ksm_fallback_vma)
- spin_unlock(&ksm_fallback_vma_lock);
- else
- kmem_cache_free(vm_area_cachep, vma);
return referenced;
}
@@ -1537,7 +1568,7 @@ int try_to_unmap_ksm(struct page *page,
struct hlist_node *hlist;
struct rmap_item *rmap_item;
int ret = SWAP_AGAIN;
- struct vm_area_struct *vma;
+ int search_new_forks = 0;
VM_BUG_ON(!PageKsm(page));
VM_BUG_ON(!PageLocked(page));
@@ -1545,35 +1576,37 @@ int try_to_unmap_ksm(struct page *page,
stable_node = page_stable_node(page);
if (!stable_node)
return SWAP_FAIL;
-
- /*
- * Temporary hack: really we need anon_vma in rmap_item, to
- * provide the correct vma, and to find recently forked instances.
- * Use zalloc to avoid weirdness if any other fields are involved.
- */
- if (TTU_ACTION(flags) != TTU_UNMAP)
- return SWAP_FAIL;
-
- vma = kmem_cache_zalloc(vm_area_cachep, GFP_ATOMIC);
- if (!vma) {
- spin_lock(&ksm_fallback_vma_lock);
- vma = &ksm_fallback_vma;
- }
-
+again:
hlist_for_each_entry(rmap_item, hlist, &stable_node->hlist, hlist) {
- vma->vm_mm = rmap_item->mm;
- vma->vm_start = rmap_item->address;
- vma->vm_end = vma->vm_start + PAGE_SIZE;
+ struct anon_vma *anon_vma = rmap_item->anon_vma;
+ struct vm_area_struct *vma;
- ret = try_to_unmap_one(page, vma, rmap_item->address, flags);
- if (ret != SWAP_AGAIN || !page_mapped(page))
- goto out;
+ spin_lock(&anon_vma->lock);
+ list_for_each_entry(vma, &anon_vma->head, anon_vma_node) {
+ if (rmap_item->address < vma->vm_start ||
+ rmap_item->address >= vma->vm_end)
+ continue;
+ /*
+ * Initially we examine only the vma which covers this
+ * rmap_item; but later, if there is still work to do,
+ * we examine covering vmas in other mms: in case they
+ * were forked from the original since ksmd passed.
+ */
+ if ((rmap_item->mm == vma->vm_mm) == search_new_forks)
+ continue;
+
+ ret = try_to_unmap_one(page, vma,
+ rmap_item->address, flags);
+ if (ret != SWAP_AGAIN || !page_mapped(page)) {
+ spin_unlock(&anon_vma->lock);
+ goto out;
+ }
+ }
+ spin_unlock(&anon_vma->lock);
}
+ if (!search_new_forks++)
+ goto again;
out:
- if (vma == &ksm_fallback_vma)
- spin_unlock(&ksm_fallback_vma_lock);
- else
- kmem_cache_free(vm_area_cachep, vma);
return ret;
}
--- ksm2/mm/rmap.c 2009-11-22 20:40:04.000000000 +0000
+++ ksm3/mm/rmap.c 2009-11-22 20:40:11.000000000 +0000
@@ -68,7 +68,7 @@ static inline struct anon_vma *anon_vma_
return kmem_cache_alloc(anon_vma_cachep, GFP_KERNEL);
}
-static inline void anon_vma_free(struct anon_vma *anon_vma)
+void anon_vma_free(struct anon_vma *anon_vma)
{
kmem_cache_free(anon_vma_cachep, anon_vma);
}
@@ -172,7 +172,7 @@ void anon_vma_unlink(struct vm_area_stru
list_del(&vma->anon_vma_node);
/* We must garbage collect the anon_vma if it's empty */
- empty = list_empty(&anon_vma->head);
+ empty = list_empty(&anon_vma->head) && !ksm_refcount(anon_vma);
spin_unlock(&anon_vma->lock);
if (empty)
@@ -184,6 +184,7 @@ static void anon_vma_ctor(void *data)
struct anon_vma *anon_vma = data;
spin_lock_init(&anon_vma->lock);
+ ksm_refcount_init(anon_vma);
INIT_LIST_HEAD(&anon_vma->head);
}
There's a lamentable flaw in KSM swapping: the stable_node holds a
reference to the ksm page, so the page to be freed cannot actually be
freed until ksmd works its way around to removing the last rmap_item
from its stable_node. Which in some configurations may take minutes:
not quite responsive enough for memory reclaim. And we don't want to
twist KSM and its locking more tightly into the rest of mm. What a pity.
But although the stable_node needs to hold a pointer to the ksm page,
does it actually need to raise the reference count of that page?
No. It would need to do so if struct pages were ordinary kmalloc'ed
objects; but they are more stable than that, and reused in particular
ways according to particular rules.
Access to stable_node from its pointer in struct page is no problem, so
long as we never free a stable_node before the ksm page itself has been
freed. Access to struct page from its pointer in stable_node: reintroduce
get_ksm_page(), and let that peep out through its keyhole (the stable_node
pointer to ksm page), to see if that struct page still holds the right key
to open it (the ksm page mapping pointer back to this stable_node).
This relies upon the established way in which free_hot_cold_page() sets
an anon (including ksm) page->mapping to NULL; and relies upon no other
user of a struct page to put something which looks like the original
stable_node pointer (with two low bits also set) into page->mapping.
It also needs get_page_unless_zero() technique pioneered by speculative
pagecache; and uses rcu_read_lock() to keep the guarantees that gives.
There are several drivers which put pointers of their own into page->
mapping; but none of those could coincide with our stable_node pointers,
since KSM won't free a stable_node until it sees that the page has gone.
The only problem case found is the pagetable spinlock USE_SPLIT_PTLOCKS
places in struct page (my own abuse): to accommodate GENERIC_LOCKBREAK's
break_lock on 32-bit, that spans both page->private and page->mapping.
Since break_lock is only 0 or 1, again no confusion for get_ksm_page().
But what of DEBUG_SPINLOCK on 64-bit bigendian? When owner_cpu is 3
(matching PageKsm low bits), it might see 0xdead4ead00000003 in page->
mapping, which might coincide? We could get around that by... but a
better answer is to suppress USE_SPLIT_PTLOCKS when DEBUG_SPINLOCK or
DEBUG_LOCK_ALLOC, to stop bloating sizeof(struct page) in their case -
already proposed in an earlier mm/Kconfig patch.
Signed-off-by: Hugh Dickins <[email protected]>
---
mm/ksm.c | 149 +++++++++++++++++++++++++++++++++++++++--------------
1 file changed, 110 insertions(+), 39 deletions(-)
--- ksm3/mm/ksm.c 2009-11-22 20:40:11.000000000 +0000
+++ ksm4/mm/ksm.c 2009-11-22 20:40:18.000000000 +0000
@@ -413,6 +413,12 @@ static void break_cow(struct rmap_item *
unsigned long addr = rmap_item->address;
struct vm_area_struct *vma;
+ /*
+ * It is not an accident that whenever we want to break COW
+ * to undo, we also need to drop a reference to the anon_vma.
+ */
+ drop_anon_vma(rmap_item);
+
down_read(&mm->mmap_sem);
if (ksm_test_exit(mm))
goto out;
@@ -456,6 +462,79 @@ out: page = NULL;
return page;
}
+static void remove_node_from_stable_tree(struct stable_node *stable_node)
+{
+ struct rmap_item *rmap_item;
+ struct hlist_node *hlist;
+
+ hlist_for_each_entry(rmap_item, hlist, &stable_node->hlist, hlist) {
+ if (rmap_item->hlist.next)
+ ksm_pages_sharing--;
+ else
+ ksm_pages_shared--;
+ drop_anon_vma(rmap_item);
+ rmap_item->address &= PAGE_MASK;
+ cond_resched();
+ }
+
+ rb_erase(&stable_node->node, &root_stable_tree);
+ free_stable_node(stable_node);
+}
+
+/*
+ * get_ksm_page: checks if the page indicated by the stable node
+ * is still its ksm page, despite having held no reference to it.
+ * In which case we can trust the content of the page, and it
+ * returns the gotten page; but if the page has now been zapped,
+ * remove the stale node from the stable tree and return NULL.
+ *
+ * You would expect the stable_node to hold a reference to the ksm page.
+ * But if it increments the page's count, swapping out has to wait for
+ * ksmd to come around again before it can free the page, which may take
+ * seconds or even minutes: much too unresponsive. So instead we use a
+ * "keyhole reference": access to the ksm page from the stable node peeps
+ * out through its keyhole to see if that page still holds the right key,
+ * pointing back to this stable node. This relies on freeing a PageAnon
+ * page to reset its page->mapping to NULL, and relies on no other use of
+ * a page to put something that might look like our key in page->mapping.
+ *
+ * include/linux/pagemap.h page_cache_get_speculative() is a good reference,
+ * but this is different - made simpler by ksm_thread_mutex being held, but
+ * interesting for assuming that no other use of the struct page could ever
+ * put our expected_mapping into page->mapping (or a field of the union which
+ * coincides with page->mapping). The RCU calls are not for KSM at all, but
+ * to keep the page_count protocol described with page_cache_get_speculative.
+ *
+ * Note: it is possible that get_ksm_page() will return NULL one moment,
+ * then page the next, if the page is in between page_freeze_refs() and
+ * page_unfreeze_refs(): this shouldn't be a problem anywhere, the page
+ * is on its way to being freed; but it is an anomaly to bear in mind.
+ */
+static struct page *get_ksm_page(struct stable_node *stable_node)
+{
+ struct page *page;
+ void *expected_mapping;
+
+ page = stable_node->page;
+ expected_mapping = (void *)stable_node +
+ (PAGE_MAPPING_ANON | PAGE_MAPPING_KSM);
+ rcu_read_lock();
+ if (page->mapping != expected_mapping)
+ goto stale;
+ if (!get_page_unless_zero(page))
+ goto stale;
+ if (page->mapping != expected_mapping) {
+ put_page(page);
+ goto stale;
+ }
+ rcu_read_unlock();
+ return page;
+stale:
+ rcu_read_unlock();
+ remove_node_from_stable_tree(stable_node);
+ return NULL;
+}
+
/*
* Removing rmap_item from stable or unstable tree.
* This function will clean the information from the stable/unstable tree.
@@ -467,22 +546,19 @@ static void remove_rmap_item_from_tree(s
struct page *page;
stable_node = rmap_item->head;
- page = stable_node->page;
- lock_page(page);
+ page = get_ksm_page(stable_node);
+ if (!page)
+ goto out;
+ lock_page(page);
hlist_del(&rmap_item->hlist);
- if (stable_node->hlist.first) {
- unlock_page(page);
- ksm_pages_sharing--;
- } else {
- set_page_stable_node(page, NULL);
- unlock_page(page);
- put_page(page);
+ unlock_page(page);
+ put_page(page);
- rb_erase(&stable_node->node, &root_stable_tree);
- free_stable_node(stable_node);
+ if (stable_node->hlist.first)
+ ksm_pages_sharing--;
+ else
ksm_pages_shared--;
- }
drop_anon_vma(rmap_item);
rmap_item->address &= PAGE_MASK;
@@ -504,7 +580,7 @@ static void remove_rmap_item_from_tree(s
ksm_pages_unshared--;
rmap_item->address &= PAGE_MASK;
}
-
+out:
cond_resched(); /* we're called from many long loops */
}
@@ -902,10 +978,8 @@ up:
* If that fails, we have a ksm page with only one pte
* pointing to it: so break it.
*/
- if (err) {
- drop_anon_vma(rmap_item);
+ if (err)
break_cow(rmap_item);
- }
}
if (err) {
put_page(kpage);
@@ -935,21 +1009,25 @@ static struct stable_node *stable_tree_s
}
while (node) {
+ struct page *tree_page;
int ret;
cond_resched();
stable_node = rb_entry(node, struct stable_node, node);
+ tree_page = get_ksm_page(stable_node);
+ if (!tree_page)
+ return NULL;
- ret = memcmp_pages(page, stable_node->page);
+ ret = memcmp_pages(page, tree_page);
- if (ret < 0)
+ if (ret < 0) {
+ put_page(tree_page);
node = node->rb_left;
- else if (ret > 0)
+ } else if (ret > 0) {
+ put_page(tree_page);
node = node->rb_right;
- else {
- get_page(stable_node->page);
+ } else
return stable_node;
- }
}
return NULL;
@@ -969,12 +1047,17 @@ static struct stable_node *stable_tree_i
struct stable_node *stable_node;
while (*new) {
+ struct page *tree_page;
int ret;
cond_resched();
stable_node = rb_entry(*new, struct stable_node, node);
+ tree_page = get_ksm_page(stable_node);
+ if (!tree_page)
+ return NULL;
- ret = memcmp_pages(kpage, stable_node->page);
+ ret = memcmp_pages(kpage, tree_page);
+ put_page(tree_page);
parent = *new;
if (ret < 0)
@@ -1000,7 +1083,6 @@ static struct stable_node *stable_tree_i
INIT_HLIST_HEAD(&stable_node->hlist);
- get_page(kpage);
stable_node->page = kpage;
set_page_stable_node(kpage, stable_node);
@@ -1130,19 +1212,10 @@ static void cmp_and_merge_page(struct pa
}
/*
- * A ksm page might have got here by fork, but its other
- * references have already been removed from the stable tree.
- * Or it might be left over from a break_ksm which failed
- * when the mem_cgroup had reached its limit: try again now.
- */
- if (PageKsm(page))
- break_cow(rmap_item);
-
- /*
- * In case the hash value of the page was changed from the last time we
- * have calculated it, this page to be changed frequely, therefore we
- * don't want to insert it to the unstable tree, and we don't want to
- * waste our time to search if there is something identical to it there.
+ * If the hash value of the page has changed from the last time
+ * we calculated it, this page is changing frequently: therefore we
+ * don't want to insert it in the unstable tree, and we don't want
+ * to waste our time searching for something identical to it there.
*/
checksum = calc_checksum(page);
if (rmap_item->oldchecksum != checksum) {
@@ -1180,9 +1253,7 @@ static void cmp_and_merge_page(struct pa
* in which case we need to break_cow on both.
*/
if (!stable_node) {
- drop_anon_vma(tree_rmap_item);
break_cow(tree_rmap_item);
- drop_anon_vma(rmap_item);
break_cow(rmap_item);
}
}
When ksm pages were unswappable, it made no sense to include them in
mem cgroup accounting; but now that they are swappable (although I see
no strict logical connection) the principle of least surprise implies
that they should be accounted (with the usual dissatisfaction, that a
shared page is accounted to only one of the cgroups using it).
This patch was intended to add mem cgroup accounting where necessary;
but turned inside out, it now avoids allocating a ksm page, instead
upgrading an anon page to ksm - which brings its existing mem cgroup
accounting with it. Thus mem cgroups don't appear in the patch at all.
This upgrade from PageAnon to PageKsm takes place under page lock
(via a somewhat hacky NULL kpage interface), and audit showed only
one place which needed to cope with the race - page_referenced() is
sometimes used without page lock, so page_lock_anon_vma() needs an
ACCESS_ONCE() to be sure of getting anon_vma and flags together
(no problem if the page goes ksm an instant after, the integrity
of that anon_vma list is unaffected).
Signed-off-by: Hugh Dickins <[email protected]>
---
mm/ksm.c | 67 ++++++++++++++++------------------------------------
mm/rmap.c | 6 +++-
2 files changed, 25 insertions(+), 48 deletions(-)
--- ksm4/mm/ksm.c 2009-11-22 20:40:18.000000000 +0000
+++ ksm5/mm/ksm.c 2009-11-22 20:40:27.000000000 +0000
@@ -831,7 +831,8 @@ out:
* try_to_merge_one_page - take two pages and merge them into one
* @vma: the vma that holds the pte pointing to page
* @page: the PageAnon page that we want to replace with kpage
- * @kpage: the PageKsm page that we want to map instead of page
+ * @kpage: the PageKsm page that we want to map instead of page,
+ * or NULL the first time when we want to use page as kpage.
*
* This function returns 0 if the pages were merged, -EFAULT otherwise.
*/
@@ -864,15 +865,24 @@ static int try_to_merge_one_page(struct
* ptes are necessarily already write-protected. But in either
* case, we need to lock and check page_count is not raised.
*/
- if (write_protect_page(vma, page, &orig_pte) == 0 &&
- pages_identical(page, kpage))
- err = replace_page(vma, page, kpage, orig_pte);
+ if (write_protect_page(vma, page, &orig_pte) == 0) {
+ if (!kpage) {
+ /*
+ * While we hold page lock, upgrade page from
+ * PageAnon+anon_vma to PageKsm+NULL stable_node:
+ * stable_tree_insert() will update stable_node.
+ */
+ set_page_stable_node(page, NULL);
+ mark_page_accessed(page);
+ err = 0;
+ } else if (pages_identical(page, kpage))
+ err = replace_page(vma, page, kpage, orig_pte);
+ }
- if ((vma->vm_flags & VM_LOCKED) && !err) {
+ if ((vma->vm_flags & VM_LOCKED) && kpage && !err) {
munlock_vma_page(page);
if (!PageMlocked(kpage)) {
unlock_page(page);
- lru_add_drain();
lock_page(kpage);
mlock_vma_page(kpage);
page = kpage; /* for final unlock */
@@ -922,7 +932,7 @@ out:
* This function returns the kpage if we successfully merged two identical
* pages into one ksm page, NULL otherwise.
*
- * Note that this function allocates a new kernel page: if one of the pages
+ * Note that this function upgrades page to ksm page: if one of the pages
* is already a ksm page, try_to_merge_with_ksm_page should be used.
*/
static struct page *try_to_merge_two_pages(struct rmap_item *rmap_item,
@@ -930,10 +940,7 @@ static struct page *try_to_merge_two_pag
struct rmap_item *tree_rmap_item,
struct page *tree_page)
{
- struct mm_struct *mm = rmap_item->mm;
- struct vm_area_struct *vma;
- struct page *kpage;
- int err = -EFAULT;
+ int err;
/*
* The number of nodes in the stable tree
@@ -943,37 +950,10 @@ static struct page *try_to_merge_two_pag
ksm_max_kernel_pages <= ksm_pages_shared)
return NULL;
- kpage = alloc_page(GFP_HIGHUSER);
- if (!kpage)
- return NULL;
-
- down_read(&mm->mmap_sem);
- if (ksm_test_exit(mm))
- goto up;
- vma = find_vma(mm, rmap_item->address);
- if (!vma || vma->vm_start > rmap_item->address)
- goto up;
-
- copy_user_highpage(kpage, page, rmap_item->address, vma);
-
- SetPageDirty(kpage);
- __SetPageUptodate(kpage);
- SetPageSwapBacked(kpage);
- set_page_stable_node(kpage, NULL); /* mark it PageKsm */
- lru_cache_add_lru(kpage, LRU_ACTIVE_ANON);
-
- err = try_to_merge_one_page(vma, page, kpage);
- if (err)
- goto up;
-
- /* Must get reference to anon_vma while still holding mmap_sem */
- hold_anon_vma(rmap_item, vma->anon_vma);
-up:
- up_read(&mm->mmap_sem);
-
+ err = try_to_merge_with_ksm_page(rmap_item, page, NULL);
if (!err) {
err = try_to_merge_with_ksm_page(tree_rmap_item,
- tree_page, kpage);
+ tree_page, page);
/*
* If that fails, we have a ksm page with only one pte
* pointing to it: so break it.
@@ -981,11 +961,7 @@ up:
if (err)
break_cow(rmap_item);
}
- if (err) {
- put_page(kpage);
- kpage = NULL;
- }
- return kpage;
+ return err ? NULL : page;
}
/*
@@ -1244,7 +1220,6 @@ static void cmp_and_merge_page(struct pa
stable_tree_append(rmap_item, stable_node);
}
unlock_page(kpage);
- put_page(kpage);
/*
* If we fail to insert the page into the stable tree,
--- ksm4/mm/rmap.c 2009-11-22 20:40:11.000000000 +0000
+++ ksm5/mm/rmap.c 2009-11-22 20:40:27.000000000 +0000
@@ -204,7 +204,7 @@ struct anon_vma *page_lock_anon_vma(stru
unsigned long anon_mapping;
rcu_read_lock();
- anon_mapping = (unsigned long) page->mapping;
+ anon_mapping = (unsigned long) ACCESS_ONCE(page->mapping);
if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
goto out;
if (!page_mapped(page))
@@ -666,7 +666,9 @@ static void __page_check_anon_rmap(struc
* @address: the user virtual address mapped
*
* The caller needs to hold the pte lock, and the page must be locked in
- * the anon_vma case: to serialize mapping,index checking after setting.
+ * the anon_vma case: to serialize mapping,index checking after setting,
+ * and to ensure that PageAnon is not being upgraded racily to PageKsm
+ * (but PageKsm is never downgraded to PageAnon).
*/
void page_add_anon_rmap(struct page *page,
struct vm_area_struct *vma, unsigned long address)
But ksm swapping does require one small change in mem cgroup handling.
When do_swap_page()'s call to ksm_might_need_to_copy() does indeed
substitute a duplicate page to accommodate a different anon_vma (or a
different index), that page escaped mem cgroup accounting, because of
the !PageSwapCache check in mem_cgroup_try_charge_swapin().
That was returning success without charging, on the assumption that
pte_same() would fail after, which is not the case here. Originally I
proposed that success, so that an unshrinkable mem cgroup at its limit
would not fail unnecessarily; but that's a minor point, and there are
plenty of other places where we may fail an overallocation which might
later prove unnecessary. So just go ahead and do what all the other
exceptions do: proceed to charge current mm.
Signed-off-by: Hugh Dickins <[email protected]>
---
mm/memcontrol.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)
--- ksm5/mm/memcontrol.c 2009-11-14 10:17:02.000000000 +0000
+++ ksm6/mm/memcontrol.c 2009-11-22 20:40:37.000000000 +0000
@@ -1862,11 +1862,12 @@ int mem_cgroup_try_charge_swapin(struct
goto charge_cur_mm;
/*
* A racing thread's fault, or swapoff, may have already updated
- * the pte, and even removed page from swap cache: return success
- * to go on to do_swap_page()'s pte_same() test, which should fail.
+ * the pte, and even removed page from swap cache: in those cases
+ * do_swap_page()'s pte_same() test will fail; but there's also a
+ * KSM case which does need to charge the page.
*/
if (!PageSwapCache(page))
- return 0;
+ goto charge_cur_mm;
mem = try_get_mem_cgroup_from_swapcache(page);
if (!mem)
goto charge_cur_mm;
A side-effect of making ksm pages swappable is that they have to be
placed on the LRUs: which then exposes them to isolate_lru_page() and
hence to page migration.
Add rmap_walk() for remove_migration_ptes() to use: rmap_walk_anon()
and rmap_walk_file() in rmap.c, but rmap_walk_ksm() in ksm.c. Perhaps
some consolidation with existing code is possible, but don't attempt
that yet (try_to_unmap needs to handle nonlinears, but migration pte
removal does not).
rmap_walk() is sadly less general than it appears: rmap_walk_anon(),
like remove_anon_migration_ptes() which it replaces, avoids calling
page_lock_anon_vma(), because that includes a page_mapped() test which
fails when all migration ptes are in place. That was valid when NUMA
page migration was introduced (holding mmap_sem provided the missing
guarantee that anon_vma's slab had not already been destroyed), but
I believe not valid in the memory hotremove case added since.
For now do the same as before, and consider the best way to fix that
unlikely race later on. When fixed, we can probably use rmap_walk()
on hwpoisoned ksm pages too: for now, they remain among hwpoison's
various exceptions (its PageKsm test comes before the page is locked,
but its page_lock_anon_vma fails safely if an anon gets upgraded).
Signed-off-by: Hugh Dickins <[email protected]>
---
include/linux/ksm.h | 13 ++++++
include/linux/rmap.h | 6 ++
mm/ksm.c | 65 +++++++++++++++++++++++++++++++
mm/migrate.c | 85 ++++++++---------------------------------
mm/rmap.c | 79 ++++++++++++++++++++++++++++++++++++++
5 files changed, 181 insertions(+), 67 deletions(-)
--- ksm6/include/linux/ksm.h 2009-11-22 20:40:04.000000000 +0000
+++ ksm7/include/linux/ksm.h 2009-11-22 20:40:46.000000000 +0000
@@ -88,6 +88,9 @@ static inline struct page *ksm_might_nee
int page_referenced_ksm(struct page *page,
struct mem_cgroup *memcg, unsigned long *vm_flags);
int try_to_unmap_ksm(struct page *page, enum ttu_flags flags);
+int rmap_walk_ksm(struct page *page, int (*rmap_one)(struct page *,
+ struct vm_area_struct *, unsigned long, void *), void *arg);
+void ksm_migrate_page(struct page *newpage, struct page *oldpage);
#else /* !CONFIG_KSM */
@@ -127,6 +130,16 @@ static inline int try_to_unmap_ksm(struc
{
return 0;
}
+
+static inline int rmap_walk_ksm(struct page *page, int (*rmap_one)(struct page*,
+ struct vm_area_struct *, unsigned long, void *), void *arg)
+{
+ return 0;
+}
+
+static inline void ksm_migrate_page(struct page *newpage, struct page *oldpage)
+{
+}
#endif /* !CONFIG_KSM */
#endif /* __LINUX_KSM_H */
--- ksm6/include/linux/rmap.h 2009-11-22 20:40:11.000000000 +0000
+++ ksm7/include/linux/rmap.h 2009-11-22 20:40:46.000000000 +0000
@@ -164,6 +164,12 @@ struct anon_vma *page_lock_anon_vma(stru
void page_unlock_anon_vma(struct anon_vma *anon_vma);
int page_mapped_in_vma(struct page *page, struct vm_area_struct *vma);
+/*
+ * Called by migrate.c to remove migration ptes, but might be used more later.
+ */
+int rmap_walk(struct page *page, int (*rmap_one)(struct page *,
+ struct vm_area_struct *, unsigned long, void *), void *arg);
+
#else /* !CONFIG_MMU */
#define anon_vma_init() do {} while (0)
--- ksm6/mm/ksm.c 2009-11-22 20:40:27.000000000 +0000
+++ ksm7/mm/ksm.c 2009-11-22 20:40:46.000000000 +0000
@@ -1656,6 +1656,71 @@ out:
return ret;
}
+#ifdef CONFIG_MIGRATION
+int rmap_walk_ksm(struct page *page, int (*rmap_one)(struct page *,
+ struct vm_area_struct *, unsigned long, void *), void *arg)
+{
+ struct stable_node *stable_node;
+ struct hlist_node *hlist;
+ struct rmap_item *rmap_item;
+ int ret = SWAP_AGAIN;
+ int search_new_forks = 0;
+
+ VM_BUG_ON(!PageKsm(page));
+ VM_BUG_ON(!PageLocked(page));
+
+ stable_node = page_stable_node(page);
+ if (!stable_node)
+ return ret;
+again:
+ hlist_for_each_entry(rmap_item, hlist, &stable_node->hlist, hlist) {
+ struct anon_vma *anon_vma = rmap_item->anon_vma;
+ struct vm_area_struct *vma;
+
+ spin_lock(&anon_vma->lock);
+ list_for_each_entry(vma, &anon_vma->head, anon_vma_node) {
+ if (rmap_item->address < vma->vm_start ||
+ rmap_item->address >= vma->vm_end)
+ continue;
+ /*
+ * Initially we examine only the vma which covers this
+ * rmap_item; but later, if there is still work to do,
+ * we examine covering vmas in other mms: in case they
+ * were forked from the original since ksmd passed.
+ */
+ if ((rmap_item->mm == vma->vm_mm) == search_new_forks)
+ continue;
+
+ ret = rmap_one(page, vma, rmap_item->address, arg);
+ if (ret != SWAP_AGAIN) {
+ spin_unlock(&anon_vma->lock);
+ goto out;
+ }
+ }
+ spin_unlock(&anon_vma->lock);
+ }
+ if (!search_new_forks++)
+ goto again;
+out:
+ return ret;
+}
+
+void ksm_migrate_page(struct page *newpage, struct page *oldpage)
+{
+ struct stable_node *stable_node;
+
+ VM_BUG_ON(!PageLocked(oldpage));
+ VM_BUG_ON(!PageLocked(newpage));
+ VM_BUG_ON(newpage->mapping != oldpage->mapping);
+
+ stable_node = page_stable_node(newpage);
+ if (stable_node) {
+ VM_BUG_ON(stable_node->page != oldpage);
+ stable_node->page = newpage;
+ }
+}
+#endif /* CONFIG_MIGRATION */
+
#ifdef CONFIG_SYSFS
/*
* This all compiles without CONFIG_SYSFS, but is a waste of space.
--- ksm6/mm/migrate.c 2009-11-14 10:17:02.000000000 +0000
+++ ksm7/mm/migrate.c 2009-11-22 20:40:46.000000000 +0000
@@ -21,6 +21,7 @@
#include <linux/mm_inline.h>
#include <linux/nsproxy.h>
#include <linux/pagevec.h>
+#include <linux/ksm.h>
#include <linux/rmap.h>
#include <linux/topology.h>
#include <linux/cpu.h>
@@ -78,8 +79,8 @@ int putback_lru_pages(struct list_head *
/*
* Restore a potential migration pte to a working pte entry
*/
-static void remove_migration_pte(struct vm_area_struct *vma,
- struct page *old, struct page *new)
+static int remove_migration_pte(struct page *new, struct vm_area_struct *vma,
+ unsigned long addr, void *old)
{
struct mm_struct *mm = vma->vm_mm;
swp_entry_t entry;
@@ -88,40 +89,37 @@ static void remove_migration_pte(struct
pmd_t *pmd;
pte_t *ptep, pte;
spinlock_t *ptl;
- unsigned long addr = page_address_in_vma(new, vma);
-
- if (addr == -EFAULT)
- return;
pgd = pgd_offset(mm, addr);
if (!pgd_present(*pgd))
- return;
+ goto out;
pud = pud_offset(pgd, addr);
if (!pud_present(*pud))
- return;
+ goto out;
pmd = pmd_offset(pud, addr);
if (!pmd_present(*pmd))
- return;
+ goto out;
ptep = pte_offset_map(pmd, addr);
if (!is_swap_pte(*ptep)) {
pte_unmap(ptep);
- return;
+ goto out;
}
ptl = pte_lockptr(mm, pmd);
spin_lock(ptl);
pte = *ptep;
if (!is_swap_pte(pte))
- goto out;
+ goto unlock;
entry = pte_to_swp_entry(pte);
- if (!is_migration_entry(entry) || migration_entry_to_page(entry) != old)
- goto out;
+ if (!is_migration_entry(entry) ||
+ migration_entry_to_page(entry) != old)
+ goto unlock;
get_page(new);
pte = pte_mkold(mk_pte(new, vma->vm_page_prot));
@@ -137,55 +135,10 @@ static void remove_migration_pte(struct
/* No need to invalidate - it was non-present before */
update_mmu_cache(vma, addr, pte);
-
-out:
+unlock:
pte_unmap_unlock(ptep, ptl);
-}
-
-/*
- * Note that remove_file_migration_ptes will only work on regular mappings,
- * Nonlinear mappings do not use migration entries.
- */
-static void remove_file_migration_ptes(struct page *old, struct page *new)
-{
- struct vm_area_struct *vma;
- struct address_space *mapping = new->mapping;
- struct prio_tree_iter iter;
- pgoff_t pgoff = new->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
-
- if (!mapping)
- return;
-
- spin_lock(&mapping->i_mmap_lock);
-
- vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff)
- remove_migration_pte(vma, old, new);
-
- spin_unlock(&mapping->i_mmap_lock);
-}
-
-/*
- * Must hold mmap_sem lock on at least one of the vmas containing
- * the page so that the anon_vma cannot vanish.
- */
-static void remove_anon_migration_ptes(struct page *old, struct page *new)
-{
- struct anon_vma *anon_vma;
- struct vm_area_struct *vma;
-
- /*
- * We hold the mmap_sem lock. So no need to call page_lock_anon_vma.
- */
- anon_vma = page_anon_vma(new);
- if (!anon_vma)
- return;
-
- spin_lock(&anon_vma->lock);
-
- list_for_each_entry(vma, &anon_vma->head, anon_vma_node)
- remove_migration_pte(vma, old, new);
-
- spin_unlock(&anon_vma->lock);
+out:
+ return SWAP_AGAIN;
}
/*
@@ -194,10 +147,7 @@ static void remove_anon_migration_ptes(s
*/
static void remove_migration_ptes(struct page *old, struct page *new)
{
- if (PageAnon(new))
- remove_anon_migration_ptes(old, new);
- else
- remove_file_migration_ptes(old, new);
+ rmap_walk(new, remove_migration_pte, old);
}
/*
@@ -358,6 +308,7 @@ static void migrate_page_copy(struct pag
}
mlock_migrate_page(newpage, page);
+ ksm_migrate_page(newpage, page);
ClearPageSwapCache(page);
ClearPagePrivate(page);
@@ -577,9 +528,9 @@ static int move_to_new_page(struct page
else
rc = fallback_migrate_page(mapping, newpage, page);
- if (!rc) {
+ if (!rc)
remove_migration_ptes(page, newpage);
- } else
+ else
newpage->mapping = NULL;
unlock_page(newpage);
--- ksm6/mm/rmap.c 2009-11-22 20:40:27.000000000 +0000
+++ ksm7/mm/rmap.c 2009-11-22 20:40:46.000000000 +0000
@@ -1199,3 +1199,82 @@ int try_to_munlock(struct page *page)
else
return try_to_unmap_file(page, TTU_MUNLOCK);
}
+
+#ifdef CONFIG_MIGRATION
+/*
+ * rmap_walk() and its helpers rmap_walk_anon() and rmap_walk_file():
+ * Called by migrate.c to remove migration ptes, but might be used more later.
+ */
+static int rmap_walk_anon(struct page *page, int (*rmap_one)(struct page *,
+ struct vm_area_struct *, unsigned long, void *), void *arg)
+{
+ struct anon_vma *anon_vma;
+ struct vm_area_struct *vma;
+ int ret = SWAP_AGAIN;
+
+ /*
+ * Note: remove_migration_ptes() cannot use page_lock_anon_vma()
+ * because that depends on page_mapped(); but not all its usages
+ * are holding mmap_sem, which also gave the necessary guarantee
+ * (that this anon_vma's slab has not already been destroyed).
+ * This needs to be reviewed later: avoiding page_lock_anon_vma()
+ * is risky, and currently limits the usefulness of rmap_walk().
+ */
+ anon_vma = page_anon_vma(page);
+ if (!anon_vma)
+ return ret;
+ spin_lock(&anon_vma->lock);
+ list_for_each_entry(vma, &anon_vma->head, anon_vma_node) {
+ unsigned long address = vma_address(page, vma);
+ if (address == -EFAULT)
+ continue;
+ ret = rmap_one(page, vma, address, arg);
+ if (ret != SWAP_AGAIN)
+ break;
+ }
+ spin_unlock(&anon_vma->lock);
+ return ret;
+}
+
+static int rmap_walk_file(struct page *page, int (*rmap_one)(struct page *,
+ struct vm_area_struct *, unsigned long, void *), void *arg)
+{
+ struct address_space *mapping = page->mapping;
+ pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+ struct vm_area_struct *vma;
+ struct prio_tree_iter iter;
+ int ret = SWAP_AGAIN;
+
+ if (!mapping)
+ return ret;
+ spin_lock(&mapping->i_mmap_lock);
+ vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
+ unsigned long address = vma_address(page, vma);
+ if (address == -EFAULT)
+ continue;
+ ret = rmap_one(page, vma, address, arg);
+ if (ret != SWAP_AGAIN)
+ break;
+ }
+ /*
+ * No nonlinear handling: being always shared, nonlinear vmas
+ * never contain migration ptes. Decide what to do about this
+ * limitation to linear when we need rmap_walk() on nonlinear.
+ */
+ spin_unlock(&mapping->i_mmap_lock);
+ return ret;
+}
+
+int rmap_walk(struct page *page, int (*rmap_one)(struct page *,
+ struct vm_area_struct *, unsigned long, void *), void *arg)
+{
+ VM_BUG_ON(!PageLocked(page));
+
+ if (unlikely(PageKsm(page)))
+ return rmap_walk_ksm(page, rmap_one, arg);
+ else if (PageAnon(page))
+ return rmap_walk_anon(page, rmap_one, arg);
+ else
+ return rmap_walk_file(page, rmap_one, arg);
+}
+#endif /* CONFIG_MIGRATION */
The previous patch enables page migration of ksm pages, but that soon
gets into trouble: not surprising, since we're using the ksm page lock
to lock operations on its stable_node, but page migration switches the
page whose lock is to be used for that. Another layer of locking would
fix it, but do we need that yet?
Do we actually need page migration of ksm pages? Yes, memory hotremove
needs to offline sections of memory: and since we stopped allocating
ksm pages with GFP_HIGHUSER, they will tend to be GFP_HIGHUSER_MOVABLE
candidates for migration.
But KSM is currently unconscious of NUMA issues, happily merging pages
from different NUMA nodes: at present the rule must be, not to use
MADV_MERGEABLE where you care about NUMA. So no, NUMA page migration
of ksm pages does not make sense yet.
So, to complete support for ksm swapping we need to make hotremove safe.
ksm_memory_callback() take ksm_thread_mutex when MEM_GOING_OFFLINE and
release it when MEM_OFFLINE or MEM_CANCEL_OFFLINE. But if mapped pages
are freed before migration reaches them, stable_nodes may be left still
pointing to struct pages which have been removed from the system: the
stable_node needs to identify a page by pfn rather than page pointer,
then it can safely prune them when MEM_OFFLINE.
And make NUMA migration skip PageKsm pages where it skips PageReserved.
But it's only when we reach unmap_and_move() that the page lock is taken
and we can be sure that raised pagecount has prevented a PageAnon from
being upgraded: so add offlining arg to migrate_pages(), to migrate ksm
page when offlining (has sufficient locking) but reject it otherwise.
Signed-off-by: Hugh Dickins <[email protected]>
---
include/linux/migrate.h | 8 +--
mm/ksm.c | 84 ++++++++++++++++++++++++++++++++------
mm/memory_hotplug.c | 2
mm/mempolicy.c | 19 +++-----
mm/migrate.c | 27 +++++++++---
5 files changed, 103 insertions(+), 37 deletions(-)
--- ksm7/include/linux/migrate.h 2009-03-23 23:12:14.000000000 +0000
+++ ksm8/include/linux/migrate.h 2009-11-22 20:40:53.000000000 +0000
@@ -12,7 +12,8 @@ typedef struct page *new_page_t(struct p
extern int putback_lru_pages(struct list_head *l);
extern int migrate_page(struct address_space *,
struct page *, struct page *);
-extern int migrate_pages(struct list_head *l, new_page_t x, unsigned long);
+extern int migrate_pages(struct list_head *l, new_page_t x,
+ unsigned long private, int offlining);
extern int fail_migrate_page(struct address_space *,
struct page *, struct page *);
@@ -26,10 +27,7 @@ extern int migrate_vmas(struct mm_struct
static inline int putback_lru_pages(struct list_head *l) { return 0; }
static inline int migrate_pages(struct list_head *l, new_page_t x,
- unsigned long private) { return -ENOSYS; }
-
-static inline int migrate_pages_to(struct list_head *pagelist,
- struct vm_area_struct *vma, int dest) { return 0; }
+ unsigned long private, int offlining) { return -ENOSYS; }
static inline int migrate_prep(void) { return -ENOSYS; }
--- ksm7/mm/ksm.c 2009-11-22 20:40:46.000000000 +0000
+++ ksm8/mm/ksm.c 2009-11-22 20:40:53.000000000 +0000
@@ -29,6 +29,7 @@
#include <linux/wait.h>
#include <linux/slab.h>
#include <linux/rbtree.h>
+#include <linux/memory.h>
#include <linux/mmu_notifier.h>
#include <linux/swap.h>
#include <linux/ksm.h>
@@ -108,14 +109,14 @@ struct ksm_scan {
/**
* struct stable_node - node of the stable rbtree
- * @page: pointer to struct page of the ksm page
* @node: rb node of this ksm page in the stable tree
* @hlist: hlist head of rmap_items using this ksm page
+ * @kpfn: page frame number of this ksm page
*/
struct stable_node {
- struct page *page;
struct rb_node node;
struct hlist_head hlist;
+ unsigned long kpfn;
};
/**
@@ -515,7 +516,7 @@ static struct page *get_ksm_page(struct
struct page *page;
void *expected_mapping;
- page = stable_node->page;
+ page = pfn_to_page(stable_node->kpfn);
expected_mapping = (void *)stable_node +
(PAGE_MAPPING_ANON | PAGE_MAPPING_KSM);
rcu_read_lock();
@@ -973,7 +974,7 @@ static struct page *try_to_merge_two_pag
* This function returns the stable tree node of identical content if found,
* NULL otherwise.
*/
-static struct stable_node *stable_tree_search(struct page *page)
+static struct page *stable_tree_search(struct page *page)
{
struct rb_node *node = root_stable_tree.rb_node;
struct stable_node *stable_node;
@@ -981,7 +982,7 @@ static struct stable_node *stable_tree_s
stable_node = page_stable_node(page);
if (stable_node) { /* ksm page forked */
get_page(page);
- return stable_node;
+ return page;
}
while (node) {
@@ -1003,7 +1004,7 @@ static struct stable_node *stable_tree_s
put_page(tree_page);
node = node->rb_right;
} else
- return stable_node;
+ return tree_page;
}
return NULL;
@@ -1059,7 +1060,7 @@ static struct stable_node *stable_tree_i
INIT_HLIST_HEAD(&stable_node->hlist);
- stable_node->page = kpage;
+ stable_node->kpfn = page_to_pfn(kpage);
set_page_stable_node(kpage, stable_node);
return stable_node;
@@ -1170,9 +1171,8 @@ static void cmp_and_merge_page(struct pa
remove_rmap_item_from_tree(rmap_item);
/* We first start with searching the page inside the stable tree */
- stable_node = stable_tree_search(page);
- if (stable_node) {
- kpage = stable_node->page;
+ kpage = stable_tree_search(page);
+ if (kpage) {
err = try_to_merge_with_ksm_page(rmap_item, page, kpage);
if (!err) {
/*
@@ -1180,7 +1180,7 @@ static void cmp_and_merge_page(struct pa
* add its rmap_item to the stable tree.
*/
lock_page(kpage);
- stable_tree_append(rmap_item, stable_node);
+ stable_tree_append(rmap_item, page_stable_node(kpage));
unlock_page(kpage);
}
put_page(kpage);
@@ -1715,12 +1715,63 @@ void ksm_migrate_page(struct page *newpa
stable_node = page_stable_node(newpage);
if (stable_node) {
- VM_BUG_ON(stable_node->page != oldpage);
- stable_node->page = newpage;
+ VM_BUG_ON(stable_node->kpfn != page_to_pfn(oldpage));
+ stable_node->kpfn = page_to_pfn(newpage);
}
}
#endif /* CONFIG_MIGRATION */
+#ifdef CONFIG_MEMORY_HOTREMOVE
+static struct stable_node *ksm_check_stable_tree(unsigned long start_pfn,
+ unsigned long end_pfn)
+{
+ struct rb_node *node;
+
+ for (node = rb_first(&root_stable_tree); node; node = rb_next(node)) {
+ struct stable_node *stable_node;
+
+ stable_node = rb_entry(node, struct stable_node, node);
+ if (stable_node->kpfn >= start_pfn &&
+ stable_node->kpfn < end_pfn)
+ return stable_node;
+ }
+ return NULL;
+}
+
+static int ksm_memory_callback(struct notifier_block *self,
+ unsigned long action, void *arg)
+{
+ struct memory_notify *mn = arg;
+ struct stable_node *stable_node;
+
+ switch (action) {
+ case MEM_GOING_OFFLINE:
+ /*
+ * Keep it very simple for now: just lock out ksmd and
+ * MADV_UNMERGEABLE while any memory is going offline.
+ */
+ mutex_lock(&ksm_thread_mutex);
+ break;
+
+ case MEM_OFFLINE:
+ /*
+ * Most of the work is done by page migration; but there might
+ * be a few stable_nodes left over, still pointing to struct
+ * pages which have been offlined: prune those from the tree.
+ */
+ while ((stable_node = ksm_check_stable_tree(mn->start_pfn,
+ mn->start_pfn + mn->nr_pages)) != NULL)
+ remove_node_from_stable_tree(stable_node);
+ /* fallthrough */
+
+ case MEM_CANCEL_OFFLINE:
+ mutex_unlock(&ksm_thread_mutex);
+ break;
+ }
+ return NOTIFY_OK;
+}
+#endif /* CONFIG_MEMORY_HOTREMOVE */
+
#ifdef CONFIG_SYSFS
/*
* This all compiles without CONFIG_SYSFS, but is a waste of space.
@@ -1946,6 +1997,13 @@ static int __init ksm_init(void)
#endif /* CONFIG_SYSFS */
+#ifdef CONFIG_MEMORY_HOTREMOVE
+ /*
+ * Choose a high priority since the callback takes ksm_thread_mutex:
+ * later callbacks could only be taking locks which nest within that.
+ */
+ hotplug_memory_notifier(ksm_memory_callback, 100);
+#endif
return 0;
out_free2:
--- ksm7/mm/memory_hotplug.c 2009-11-14 10:17:02.000000000 +0000
+++ ksm8/mm/memory_hotplug.c 2009-11-22 20:40:53.000000000 +0000
@@ -698,7 +698,7 @@ do_migrate_range(unsigned long start_pfn
if (list_empty(&source))
goto out;
/* this function returns # of failed pages */
- ret = migrate_pages(&source, hotremove_migrate_alloc, 0);
+ ret = migrate_pages(&source, hotremove_migrate_alloc, 0, 1);
out:
return ret;
--- ksm7/mm/mempolicy.c 2009-11-14 10:17:02.000000000 +0000
+++ ksm8/mm/mempolicy.c 2009-11-22 20:40:53.000000000 +0000
@@ -85,6 +85,7 @@
#include <linux/seq_file.h>
#include <linux/proc_fs.h>
#include <linux/migrate.h>
+#include <linux/ksm.h>
#include <linux/rmap.h>
#include <linux/security.h>
#include <linux/syscalls.h>
@@ -413,17 +414,11 @@ static int check_pte_range(struct vm_are
if (!page)
continue;
/*
- * The check for PageReserved here is important to avoid
- * handling zero pages and other pages that may have been
- * marked special by the system.
- *
- * If the PageReserved would not be checked here then f.e.
- * the location of the zero page could have an influence
- * on MPOL_MF_STRICT, zero pages would be counted for
- * the per node stats, and there would be useless attempts
- * to put zero pages on the migration list.
+ * vm_normal_page() filters out zero pages, but there might
+ * still be PageReserved pages to skip, perhaps in a VDSO.
+ * And we cannot move PageKsm pages sensibly or safely yet.
*/
- if (PageReserved(page))
+ if (PageReserved(page) || PageKsm(page))
continue;
nid = page_to_nid(page);
if (node_isset(nid, *nodes) == !!(flags & MPOL_MF_INVERT))
@@ -839,7 +834,7 @@ static int migrate_to_node(struct mm_str
flags | MPOL_MF_DISCONTIG_OK, &pagelist);
if (!list_empty(&pagelist))
- err = migrate_pages(&pagelist, new_node_page, dest);
+ err = migrate_pages(&pagelist, new_node_page, dest, 0);
return err;
}
@@ -1056,7 +1051,7 @@ static long do_mbind(unsigned long start
if (!list_empty(&pagelist))
nr_failed = migrate_pages(&pagelist, new_vma_page,
- (unsigned long)vma);
+ (unsigned long)vma, 0);
if (!err && nr_failed && (flags & MPOL_MF_STRICT))
err = -EIO;
--- ksm7/mm/migrate.c 2009-11-22 20:40:46.000000000 +0000
+++ ksm8/mm/migrate.c 2009-11-22 20:40:53.000000000 +0000
@@ -543,7 +543,7 @@ static int move_to_new_page(struct page
* to the newly allocated page in newpage.
*/
static int unmap_and_move(new_page_t get_new_page, unsigned long private,
- struct page *page, int force)
+ struct page *page, int force, int offlining)
{
int rc = 0;
int *result = NULL;
@@ -569,6 +569,20 @@ static int unmap_and_move(new_page_t get
lock_page(page);
}
+ /*
+ * Only memory hotplug's offline_pages() caller has locked out KSM,
+ * and can safely migrate a KSM page. The other cases have skipped
+ * PageKsm along with PageReserved - but it is only now when we have
+ * the page lock that we can be certain it will not go KSM beneath us
+ * (KSM will not upgrade a page from PageAnon to PageKsm when it sees
+ * its pagecount raised, but only here do we take the page lock which
+ * serializes that).
+ */
+ if (PageKsm(page) && !offlining) {
+ rc = -EBUSY;
+ goto unlock;
+ }
+
/* charge against new page */
charge = mem_cgroup_prepare_migration(page, &mem);
if (charge == -ENOMEM) {
@@ -685,7 +699,7 @@ move_newpage:
* Return: Number of pages not migrated or error code.
*/
int migrate_pages(struct list_head *from,
- new_page_t get_new_page, unsigned long private)
+ new_page_t get_new_page, unsigned long private, int offlining)
{
int retry = 1;
int nr_failed = 0;
@@ -705,7 +719,7 @@ int migrate_pages(struct list_head *from
cond_resched();
rc = unmap_and_move(get_new_page, private,
- page, pass > 2);
+ page, pass > 2, offlining);
switch(rc) {
case -ENOMEM:
@@ -801,7 +815,8 @@ static int do_move_page_to_node_array(st
if (!page)
goto set_status;
- if (PageReserved(page)) /* Check for zero page */
+ /* Use PageReserved to check for zero page */
+ if (PageReserved(page) || PageKsm(page))
goto put_and_set;
pp->page = page;
@@ -838,7 +853,7 @@ set_status:
err = 0;
if (!list_empty(&pagelist))
err = migrate_pages(&pagelist, new_page_node,
- (unsigned long)pm);
+ (unsigned long)pm, 0);
up_read(&mm->mmap_sem);
return err;
@@ -959,7 +974,7 @@ static void do_pages_stat_array(struct m
err = -ENOENT;
/* Use PageReserved to check for zero page */
- if (!page || PageReserved(page))
+ if (!page || PageReserved(page) || PageKsm(page))
goto set_status;
err = page_to_nid(page);
Now that ksm pages are swappable, and the known holes plugged, remove
mention of unswappable kernel pages from KSM documentation and comments.
Remove the totalram_pages/4 initialization of max_kernel_pages. In fact,
remove max_kernel_pages altogether - we can reinstate it if removal turns
out to break someone's script; but if we later want to limit KSM's memory
usage, limiting the stable nodes would not be an effective approach.
Signed-off-by: Hugh Dickins <[email protected]>
---
Documentation/vm/ksm.txt | 22 ++++++-------------
mm/Kconfig | 2 -
mm/ksm.c | 41 +------------------------------------
3 files changed, 10 insertions(+), 55 deletions(-)
--- ksm8/Documentation/vm/ksm.txt 2009-10-12 00:26:36.000000000 +0100
+++ ksm9/Documentation/vm/ksm.txt 2009-11-22 20:41:00.000000000 +0000
@@ -16,9 +16,9 @@ by sharing the data common between them.
application which generates many instances of the same data.
KSM only merges anonymous (private) pages, never pagecache (file) pages.
-KSM's merged pages are at present locked into kernel memory for as long
-as they are shared: so cannot be swapped out like the user pages they
-replace (but swapping KSM pages should follow soon in a later release).
+KSM's merged pages were originally locked into kernel memory, but can now
+be swapped out just like other user pages (but sharing is broken when they
+are swapped back in: ksmd must rediscover their identity and merge again).
KSM only operates on those areas of address space which an application
has advised to be likely candidates for merging, by using the madvise(2)
@@ -44,20 +44,12 @@ includes unmapped gaps (though working o
and might fail with EAGAIN if not enough memory for internal structures.
Applications should be considerate in their use of MADV_MERGEABLE,
-restricting its use to areas likely to benefit. KSM's scans may use
-a lot of processing power, and its kernel-resident pages are a limited
-resource. Some installations will disable KSM for these reasons.
+restricting its use to areas likely to benefit. KSM's scans may use a lot
+of processing power: some installations will disable KSM for that reason.
The KSM daemon is controlled by sysfs files in /sys/kernel/mm/ksm/,
readable by all but writable only by root:
-max_kernel_pages - set to maximum number of kernel pages that KSM may use
- e.g. "echo 100000 > /sys/kernel/mm/ksm/max_kernel_pages"
- Value 0 imposes no limit on the kernel pages KSM may use;
- but note that any process using MADV_MERGEABLE can cause
- KSM to allocate these pages, unswappable until it exits.
- Default: quarter of memory (chosen to not pin too much)
-
pages_to_scan - how many present pages to scan before ksmd goes to sleep
e.g. "echo 100 > /sys/kernel/mm/ksm/pages_to_scan"
Default: 100 (chosen for demonstration purposes)
@@ -75,7 +67,7 @@ run - set 0 to stop ksmd fr
The effectiveness of KSM and MADV_MERGEABLE is shown in /sys/kernel/mm/ksm/:
-pages_shared - how many shared unswappable kernel pages KSM is using
+pages_shared - how many shared pages are being used
pages_sharing - how many more sites are sharing them i.e. how much saved
pages_unshared - how many pages unique but repeatedly checked for merging
pages_volatile - how many pages changing too fast to be placed in a tree
@@ -87,4 +79,4 @@ pages_volatile embraces several differen
proportion there would also indicate poor use of madvise MADV_MERGEABLE.
Izik Eidus,
-Hugh Dickins, 24 Sept 2009
+Hugh Dickins, 17 Nov 2009
--- ksm8/mm/Kconfig 2009-11-14 10:17:02.000000000 +0000
+++ ksm9/mm/Kconfig 2009-11-22 20:41:00.000000000 +0000
@@ -212,7 +212,7 @@ config KSM
Enable Kernel Samepage Merging: KSM periodically scans those areas
of an application's address space that an app has advised may be
mergeable. When it finds pages of identical content, it replaces
- the many instances by a single resident page with that content, so
+ the many instances by a single page with that content, so
saving memory until one or another app needs to modify the content.
Recommended for use with KVM, or with other duplicative applications.
See Documentation/vm/ksm.txt for more information: KSM is inactive
--- ksm8/mm/ksm.c 2009-11-22 20:40:53.000000000 +0000
+++ ksm9/mm/ksm.c 2009-11-22 20:41:00.000000000 +0000
@@ -179,9 +179,6 @@ static unsigned long ksm_pages_unshared;
/* The number of rmap_items in use: to calculate pages_volatile */
static unsigned long ksm_rmap_items;
-/* Limit on the number of unswappable pages used */
-static unsigned long ksm_max_kernel_pages;
-
/* Number of pages ksmd should scan in one batch */
static unsigned int ksm_thread_pages_to_scan = 100;
@@ -943,14 +940,6 @@ static struct page *try_to_merge_two_pag
{
int err;
- /*
- * The number of nodes in the stable tree
- * is the number of kernel pages that we hold.
- */
- if (ksm_max_kernel_pages &&
- ksm_max_kernel_pages <= ksm_pages_shared)
- return NULL;
-
err = try_to_merge_with_ksm_page(rmap_item, page, NULL);
if (!err) {
err = try_to_merge_with_ksm_page(tree_rmap_item,
@@ -1850,8 +1839,8 @@ static ssize_t run_store(struct kobject
/*
* KSM_RUN_MERGE sets ksmd running, and 0 stops it running.
* KSM_RUN_UNMERGE stops it running and unmerges all rmap_items,
- * breaking COW to free the unswappable pages_shared (but leaves
- * mm_slots on the list for when ksmd may be set running again).
+ * breaking COW to free the pages_shared (but leaves mm_slots
+ * on the list for when ksmd may be set running again).
*/
mutex_lock(&ksm_thread_mutex);
@@ -1876,29 +1865,6 @@ static ssize_t run_store(struct kobject
}
KSM_ATTR(run);
-static ssize_t max_kernel_pages_store(struct kobject *kobj,
- struct kobj_attribute *attr,
- const char *buf, size_t count)
-{
- int err;
- unsigned long nr_pages;
-
- err = strict_strtoul(buf, 10, &nr_pages);
- if (err)
- return -EINVAL;
-
- ksm_max_kernel_pages = nr_pages;
-
- return count;
-}
-
-static ssize_t max_kernel_pages_show(struct kobject *kobj,
- struct kobj_attribute *attr, char *buf)
-{
- return sprintf(buf, "%lu\n", ksm_max_kernel_pages);
-}
-KSM_ATTR(max_kernel_pages);
-
static ssize_t pages_shared_show(struct kobject *kobj,
struct kobj_attribute *attr, char *buf)
{
@@ -1948,7 +1914,6 @@ static struct attribute *ksm_attrs[] = {
&sleep_millisecs_attr.attr,
&pages_to_scan_attr.attr,
&run_attr.attr,
- &max_kernel_pages_attr.attr,
&pages_shared_attr.attr,
&pages_sharing_attr.attr,
&pages_unshared_attr.attr,
@@ -1968,8 +1933,6 @@ static int __init ksm_init(void)
struct task_struct *ksm_thread;
int err;
- ksm_max_kernel_pages = totalram_pages / 4;
-
err = ksm_slab_init();
if (err)
goto out;
On 11/24/2009 11:40 AM, Hugh Dickins wrote:
> When KSM merges an mlocked page, it has been forgetting to munlock it:
> that's been left to free_page_mlock(), which reports it in /proc/vmstat
> as unevictable_pgs_mlockfreed instead of unevictable_pgs_munlocked (and
> whinges "Page flag mlocked set for process" in mmotm, whereas mainline
> is silently forgiving). Call munlock_vma_page() to fix that.
>
> Signed-off-by: Hugh Dickins<[email protected]>
>
>
Acked-by: Rik van Riel <[email protected]>
* Hugh Dickins <[email protected]> [2009-11-24 16:51:13]:
> But ksm swapping does require one small change in mem cgroup handling.
> When do_swap_page()'s call to ksm_might_need_to_copy() does indeed
> substitute a duplicate page to accommodate a different anon_vma (or a
> different index), that page escaped mem cgroup accounting, because of
> the !PageSwapCache check in mem_cgroup_try_charge_swapin().
>
The duplicate page doesn't show up as PageSwapCache or are we optimizing
for the race condition where the page is not in SwapCache? I should
probably look at the full series.
> That was returning success without charging, on the assumption that
> pte_same() would fail after, which is not the case here. Originally I
> proposed that success, so that an unshrinkable mem cgroup at its limit
> would not fail unnecessarily; but that's a minor point, and there are
> plenty of other places where we may fail an overallocation which might
> later prove unnecessary. So just go ahead and do what all the other
> exceptions do: proceed to charge current mm.
>
> Signed-off-by: Hugh Dickins <[email protected]>
Thanks for the patch!
Acked-by: Balbir Singh <[email protected]>
--
Balbir
On Wed, 25 Nov 2009, Balbir Singh wrote:
> * Hugh Dickins <[email protected]> [2009-11-24 16:51:13]:
>
> > But ksm swapping does require one small change in mem cgroup handling.
> > When do_swap_page()'s call to ksm_might_need_to_copy() does indeed
> > substitute a duplicate page to accommodate a different anon_vma (or a
> > different index), that page escaped mem cgroup accounting, because of
> > the !PageSwapCache check in mem_cgroup_try_charge_swapin().
> >
>
> The duplicate page doesn't show up as PageSwapCache
That's right.
> or are we optimizing
> for the race condition where the page is not in SwapCache?
No, optimization wasn't on my mind at all. To be honest, it's slightly
worsening the case of the race in which another thread has independently
faulted it in, and then removed it from swap cache. But I think we'll
agree that that's rare enough a case that a few more cycles doing it
won't matter.
> I should probably look at the full series.
2/9 is the one which brings the problem: it's ksm_might_need_to_copy()
(an inline which tests for the condition) and ksm_does_need_to_copy()
(which makes a duplicate page when the condition has been found so).
The problem arises because an Anon struct page contains a pointer to
its anon_vma, used to locate its ptes when swapping. Suddenly, with
KSM swapping, an anon page may get read in from swap, faulted in and
pointed to its anon_vma, everything fine; but then faulted in again
somewhere else, and needs to be pointed to a different anon_vma...
Lose its anon_vma and it becomes unswappable, not a good choice when
trying to extend swappability: so instead we allocate a duplicate page
just to point to the different anon_vma; and if they last long enough,
unchanged, KSM will come around again to find them the same and
remerge them. Not an efficient solution, but a simple solution,
much in keeping with the way KSM already works.
The duplicate page is not PageSwapCache: certainly it crossed my mind
to try making it PageSwapCache like the original, but I think that
raises lots of other problems (how do we make the radix_tree slot
for that offset hold two page pointers?).
Hugh
* Hugh Dickins <[email protected]> [2009-11-25 17:12:13]:
> On Wed, 25 Nov 2009, Balbir Singh wrote:
> > * Hugh Dickins <[email protected]> [2009-11-24 16:51:13]:
> >
> > > But ksm swapping does require one small change in mem cgroup handling.
> > > When do_swap_page()'s call to ksm_might_need_to_copy() does indeed
> > > substitute a duplicate page to accommodate a different anon_vma (or a
> > > different index), that page escaped mem cgroup accounting, because of
> > > the !PageSwapCache check in mem_cgroup_try_charge_swapin().
> > >
> >
> > The duplicate page doesn't show up as PageSwapCache
>
> That's right.
>
> > or are we optimizing
> > for the race condition where the page is not in SwapCache?
>
> No, optimization wasn't on my mind at all. To be honest, it's slightly
> worsening the case of the race in which another thread has independently
> faulted it in, and then removed it from swap cache. But I think we'll
> agree that that's rare enough a case that a few more cycles doing it
> won't matter.
>
Thanks for clarifying, yes I agree that the condition is rare and
nothing for us to worry about about at the moment.
> > I should probably look at the full series.
>
> 2/9 is the one which brings the problem: it's ksm_might_need_to_copy()
> (an inline which tests for the condition) and ksm_does_need_to_copy()
> (which makes a duplicate page when the condition has been found so).
>
> The problem arises because an Anon struct page contains a pointer to
> its anon_vma, used to locate its ptes when swapping. Suddenly, with
> KSM swapping, an anon page may get read in from swap, faulted in and
> pointed to its anon_vma, everything fine; but then faulted in again
> somewhere else, and needs to be pointed to a different anon_vma...
>
> Lose its anon_vma and it becomes unswappable, not a good choice when
> trying to extend swappability: so instead we allocate a duplicate page
> just to point to the different anon_vma; and if they last long enough,
> unchanged, KSM will come around again to find them the same and
> remerge them. Not an efficient solution, but a simple solution,
> much in keeping with the way KSM already works.
>
> The duplicate page is not PageSwapCache: certainly it crossed my mind
> to try making it PageSwapCache like the original, but I think that
> raises lots of other problems (how do we make the radix_tree slot
> for that offset hold two page pointers?).
>
Thanks for the detailed explanation, it does help me understand what
is going on.
--
Balbir
On Tue, Nov 24, 2009 at 04:40:55PM +0000, Hugh Dickins wrote:
> When KSM merges an mlocked page, it has been forgetting to munlock it:
> that's been left to free_page_mlock(), which reports it in /proc/vmstat
> as unevictable_pgs_mlockfreed instead of unevictable_pgs_munlocked (and
> whinges "Page flag mlocked set for process" in mmotm, whereas mainline
> is silently forgiving). Call munlock_vma_page() to fix that.
>
> Signed-off-by: Hugh Dickins <[email protected]>
Acked-by: Mel Gorman <[email protected]>
> ---
> Is this a fix that I ought to backport to 2.6.32? It does rely on part of
> an earlier patch (moved unlock_page down), so does not apply cleanly as is.
>
> mm/internal.h | 3 ++-
> mm/ksm.c | 4 ++++
> mm/mlock.c | 4 ++--
> 3 files changed, 8 insertions(+), 3 deletions(-)
>
> --- ksm0/mm/internal.h 2009-11-14 10:17:02.000000000 +0000
> +++ ksm1/mm/internal.h 2009-11-22 20:39:56.000000000 +0000
> @@ -105,9 +105,10 @@ static inline int is_mlocked_vma(struct
> }
>
> /*
> - * must be called with vma's mmap_sem held for read, and page locked.
> + * must be called with vma's mmap_sem held for read or write, and page locked.
> */
> extern void mlock_vma_page(struct page *page);
> +extern void munlock_vma_page(struct page *page);
>
> /*
> * Clear the page's PageMlocked(). This can be useful in a situation where
> --- ksm0/mm/ksm.c 2009-11-14 10:17:02.000000000 +0000
> +++ ksm1/mm/ksm.c 2009-11-22 20:39:56.000000000 +0000
> @@ -34,6 +34,7 @@
> #include <linux/ksm.h>
>
> #include <asm/tlbflush.h>
> +#include "internal.h"
>
> /*
> * A few notes about the KSM scanning process,
> @@ -762,6 +763,9 @@ static int try_to_merge_one_page(struct
> pages_identical(page, kpage))
> err = replace_page(vma, page, kpage, orig_pte);
>
> + if ((vma->vm_flags & VM_LOCKED) && !err)
> + munlock_vma_page(page);
> +
> unlock_page(page);
> out:
> return err;
> --- ksm0/mm/mlock.c 2009-11-14 10:17:02.000000000 +0000
> +++ ksm1/mm/mlock.c 2009-11-22 20:39:56.000000000 +0000
> @@ -99,14 +99,14 @@ void mlock_vma_page(struct page *page)
> * not get another chance to clear PageMlocked. If we successfully
> * isolate the page and try_to_munlock() detects other VM_LOCKED vmas
> * mapping the page, it will restore the PageMlocked state, unless the page
> - * is mapped in a non-linear vma. So, we go ahead and SetPageMlocked(),
> + * is mapped in a non-linear vma. So, we go ahead and ClearPageMlocked(),
> * perhaps redundantly.
> * If we lose the isolation race, and the page is mapped by other VM_LOCKED
> * vmas, we'll detect this in vmscan--via try_to_munlock() or try_to_unmap()
> * either of which will restore the PageMlocked state by calling
> * mlock_vma_page() above, if it can grab the vma's mmap sem.
> */
> -static void munlock_vma_page(struct page *page)
> +void munlock_vma_page(struct page *page)
> {
> BUG_ON(!PageLocked(page));
>
>
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
On Thu, 26 Nov 2009, Mel Gorman wrote:
> On Tue, Nov 24, 2009 at 04:40:55PM +0000, Hugh Dickins wrote:
> > When KSM merges an mlocked page, it has been forgetting to munlock it:
> > that's been left to free_page_mlock(), which reports it in /proc/vmstat
> > as unevictable_pgs_mlockfreed instead of unevictable_pgs_munlocked (and
> > whinges "Page flag mlocked set for process" in mmotm, whereas mainline
> > is silently forgiving). Call munlock_vma_page() to fix that.
> >
> > Signed-off-by: Hugh Dickins <[email protected]>
>
> Acked-by: Mel Gorman <[email protected]>
Rik & Mel, thanks for the Acks.
But please clarify: that patch was for mmotm and hopefully 2.6.33,
but the vmstat issue (minus warning message) is there in 2.6.32-rc.
Should I
(a) forget it for 2.6.32
(b) rush Linus a patch for 2.6.32 final
(c) send a patch for 2.6.32.stable later on
? I just don't have a feel for how important this is.
Typically, these pages are immediately freed, and the only issue is
which stats they get added to; but if fork has copied them into other
mms, then such pages might stay unevictable indefinitely, despite no
longer being in any mlocked vma.
There's a remark in munlock_vma_page(), apropos a different issue,
/*
* We lost the race. let try_to_unmap() deal
* with it. At least we get the page state and
* mlock stats right. However, page is still on
* the noreclaim list. We'll fix that up when
* the page is eventually freed or we scan the
* noreclaim list.
*/
which implies that sometimes we scan the unevictable list and resolve
such cases. But I wonder if that's nowadays the case?
>
> > ---
> > Is this a fix that I ought to backport to 2.6.32? It does rely on part of
> > an earlier patch (moved unlock_page down), so does not apply cleanly as is.
Thanks,
Hugh
Sorry for delayed response.
On Tue, 24 Nov 2009 16:48:46 +0000 (GMT)
Hugh Dickins <[email protected]> wrote:
> When ksm pages were unswappable, it made no sense to include them in
> mem cgroup accounting; but now that they are swappable (although I see
> no strict logical connection)
I asked that for throwing away too complicated but wast of time things.
If not on LRU, its own limitation (ksm's page limit) works enough.
> the principle of least surprise implies
> that they should be accounted (with the usual dissatisfaction, that a
> shared page is accounted to only one of the cgroups using it).
>
> This patch was intended to add mem cgroup accounting where necessary;
> but turned inside out, it now avoids allocating a ksm page, instead
> upgrading an anon page to ksm - which brings its existing mem cgroup
> accounting with it. Thus mem cgroups don't appear in the patch at all.
>
ok. then, what I should see is patch 6.
Thanks,
-Kame
> This upgrade from PageAnon to PageKsm takes place under page lock
> (via a somewhat hacky NULL kpage interface), and audit showed only
> one place which needed to cope with the race - page_referenced() is
> sometimes used without page lock, so page_lock_anon_vma() needs an
> ACCESS_ONCE() to be sure of getting anon_vma and flags together
> (no problem if the page goes ksm an instant after, the integrity
> of that anon_vma list is unaffected).
>
> Signed-off-by: Hugh Dickins <[email protected]>
> ---
>
> mm/ksm.c | 67 ++++++++++++++++------------------------------------
> mm/rmap.c | 6 +++-
> 2 files changed, 25 insertions(+), 48 deletions(-)
>
> --- ksm4/mm/ksm.c 2009-11-22 20:40:18.000000000 +0000
> +++ ksm5/mm/ksm.c 2009-11-22 20:40:27.000000000 +0000
> @@ -831,7 +831,8 @@ out:
> * try_to_merge_one_page - take two pages and merge them into one
> * @vma: the vma that holds the pte pointing to page
> * @page: the PageAnon page that we want to replace with kpage
> - * @kpage: the PageKsm page that we want to map instead of page
> + * @kpage: the PageKsm page that we want to map instead of page,
> + * or NULL the first time when we want to use page as kpage.
> *
> * This function returns 0 if the pages were merged, -EFAULT otherwise.
> */
> @@ -864,15 +865,24 @@ static int try_to_merge_one_page(struct
> * ptes are necessarily already write-protected. But in either
> * case, we need to lock and check page_count is not raised.
> */
> - if (write_protect_page(vma, page, &orig_pte) == 0 &&
> - pages_identical(page, kpage))
> - err = replace_page(vma, page, kpage, orig_pte);
> + if (write_protect_page(vma, page, &orig_pte) == 0) {
> + if (!kpage) {
> + /*
> + * While we hold page lock, upgrade page from
> + * PageAnon+anon_vma to PageKsm+NULL stable_node:
> + * stable_tree_insert() will update stable_node.
> + */
> + set_page_stable_node(page, NULL);
> + mark_page_accessed(page);
> + err = 0;
> + } else if (pages_identical(page, kpage))
> + err = replace_page(vma, page, kpage, orig_pte);
> + }
>
> - if ((vma->vm_flags & VM_LOCKED) && !err) {
> + if ((vma->vm_flags & VM_LOCKED) && kpage && !err) {
> munlock_vma_page(page);
> if (!PageMlocked(kpage)) {
> unlock_page(page);
> - lru_add_drain();
Is this related to memcg ?
> lock_page(kpage);
> mlock_vma_page(kpage);
> page = kpage; /* for final unlock */
> @@ -922,7 +932,7 @@ out:
> * This function returns the kpage if we successfully merged two identical
> * pages into one ksm page, NULL otherwise.
> *
> - * Note that this function allocates a new kernel page: if one of the pages
> + * Note that this function upgrades page to ksm page: if one of the pages
> * is already a ksm page, try_to_merge_with_ksm_page should be used.
> */
> static struct page *try_to_merge_two_pages(struct rmap_item *rmap_item,
> @@ -930,10 +940,7 @@ static struct page *try_to_merge_two_pag
> struct rmap_item *tree_rmap_item,
> struct page *tree_page)
> {
> - struct mm_struct *mm = rmap_item->mm;
> - struct vm_area_struct *vma;
> - struct page *kpage;
> - int err = -EFAULT;
> + int err;
>
> /*
> * The number of nodes in the stable tree
> @@ -943,37 +950,10 @@ static struct page *try_to_merge_two_pag
> ksm_max_kernel_pages <= ksm_pages_shared)
> return NULL;
>
> - kpage = alloc_page(GFP_HIGHUSER);
> - if (!kpage)
> - return NULL;
> -
> - down_read(&mm->mmap_sem);
> - if (ksm_test_exit(mm))
> - goto up;
> - vma = find_vma(mm, rmap_item->address);
> - if (!vma || vma->vm_start > rmap_item->address)
> - goto up;
> -
> - copy_user_highpage(kpage, page, rmap_item->address, vma);
> -
> - SetPageDirty(kpage);
> - __SetPageUptodate(kpage);
> - SetPageSwapBacked(kpage);
> - set_page_stable_node(kpage, NULL); /* mark it PageKsm */
> - lru_cache_add_lru(kpage, LRU_ACTIVE_ANON);
> -
> - err = try_to_merge_one_page(vma, page, kpage);
> - if (err)
> - goto up;
> -
> - /* Must get reference to anon_vma while still holding mmap_sem */
> - hold_anon_vma(rmap_item, vma->anon_vma);
> -up:
> - up_read(&mm->mmap_sem);
> -
> + err = try_to_merge_with_ksm_page(rmap_item, page, NULL);
> if (!err) {
> err = try_to_merge_with_ksm_page(tree_rmap_item,
> - tree_page, kpage);
> + tree_page, page);
> /*
> * If that fails, we have a ksm page with only one pte
> * pointing to it: so break it.
> @@ -981,11 +961,7 @@ up:
> if (err)
> break_cow(rmap_item);
> }
> - if (err) {
> - put_page(kpage);
> - kpage = NULL;
> - }
> - return kpage;
> + return err ? NULL : page;
> }
>
> /*
> @@ -1244,7 +1220,6 @@ static void cmp_and_merge_page(struct pa
> stable_tree_append(rmap_item, stable_node);
> }
> unlock_page(kpage);
> - put_page(kpage);
>
> /*
> * If we fail to insert the page into the stable tree,
> --- ksm4/mm/rmap.c 2009-11-22 20:40:11.000000000 +0000
> +++ ksm5/mm/rmap.c 2009-11-22 20:40:27.000000000 +0000
> @@ -204,7 +204,7 @@ struct anon_vma *page_lock_anon_vma(stru
> unsigned long anon_mapping;
>
> rcu_read_lock();
> - anon_mapping = (unsigned long) page->mapping;
> + anon_mapping = (unsigned long) ACCESS_ONCE(page->mapping);
> if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
> goto out;
> if (!page_mapped(page))
> @@ -666,7 +666,9 @@ static void __page_check_anon_rmap(struc
> * @address: the user virtual address mapped
> *
> * The caller needs to hold the pte lock, and the page must be locked in
> - * the anon_vma case: to serialize mapping,index checking after setting.
> + * the anon_vma case: to serialize mapping,index checking after setting,
> + * and to ensure that PageAnon is not being upgraded racily to PageKsm
> + * (but PageKsm is never downgraded to PageAnon).
> */
> void page_add_anon_rmap(struct page *page,
> struct vm_area_struct *vma, unsigned long address)
On Tue, 24 Nov 2009 16:51:13 +0000 (GMT)
Hugh Dickins <[email protected]> wrote:
> But ksm swapping does require one small change in mem cgroup handling.
> When do_swap_page()'s call to ksm_might_need_to_copy() does indeed
> substitute a duplicate page to accommodate a different anon_vma (or a
> different index), that page escaped mem cgroup accounting, because of
> the !PageSwapCache check in mem_cgroup_try_charge_swapin().
>
> That was returning success without charging, on the assumption that
> pte_same() would fail after, which is not the case here. Originally I
> proposed that success, so that an unshrinkable mem cgroup at its limit
> would not fail unnecessarily; but that's a minor point, and there are
> plenty of other places where we may fail an overallocation which might
> later prove unnecessary. So just go ahead and do what all the other
> exceptions do: proceed to charge current mm.
>
> Signed-off-by: Hugh Dickins <[email protected]>
Ok. Maybe commit_charge will work enough. (I hope so.)
Acked-by: KAMEZAWA Hiroyuki <[email protected]>
BTW, I'm happy if you adds "How to test" documenation to
Documentation/vm/ksm.txt or to share some test programs.
1. Map anonymous pages + madvise(MADV_MERGEABLE)
2. "echo 1 > /sys/kernel/mm/ksm/run"
is enough ?
Thanks,
-Kame
> ---
>
> mm/memcontrol.c | 7 ++++---
> 1 file changed, 4 insertions(+), 3 deletions(-)
>
> --- ksm5/mm/memcontrol.c 2009-11-14 10:17:02.000000000 +0000
> +++ ksm6/mm/memcontrol.c 2009-11-22 20:40:37.000000000 +0000
> @@ -1862,11 +1862,12 @@ int mem_cgroup_try_charge_swapin(struct
> goto charge_cur_mm;
> /*
> * A racing thread's fault, or swapoff, may have already updated
> - * the pte, and even removed page from swap cache: return success
> - * to go on to do_swap_page()'s pte_same() test, which should fail.
> + * the pte, and even removed page from swap cache: in those cases
> + * do_swap_page()'s pte_same() test will fail; but there's also a
> + * KSM case which does need to charge the page.
> */
> if (!PageSwapCache(page))
> - return 0;
> + goto charge_cur_mm;
> mem = try_get_mem_cgroup_from_swapcache(page);
> if (!mem)
> goto charge_cur_mm;
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
On Tue, 24 Nov 2009 16:42:15 +0000 (GMT)
Hugh Dickins <[email protected]> wrote:
> +int page_referenced_ksm(struct page *page, struct mem_cgroup *memcg,
> + unsigned long *vm_flags)
> +{
> + struct stable_node *stable_node;
> + struct rmap_item *rmap_item;
> + struct hlist_node *hlist;
> + unsigned int mapcount = page_mapcount(page);
> + int referenced = 0;
> + struct vm_area_struct *vma;
> +
> + VM_BUG_ON(!PageKsm(page));
> + VM_BUG_ON(!PageLocked(page));
> +
> + stable_node = page_stable_node(page);
> + if (!stable_node)
> + return 0;
> +
Hmm. I'm not sure how many pages are shared in a system but
can't we add some threshold for avoidng too much scan against shared pages ?
(in vmscan.c)
like..
if (page_mapcount(page) > (XXXX >> scan_priority))
return 1;
I saw terrible slow downs in shmem-swap-out in old RHELs (at user support).
(Added kosaki to CC.)
After this patch, the number of shared swappable page will be unlimited.
Thanks,
-Kame
> On Thu, 26 Nov 2009, Mel Gorman wrote:
> > On Tue, Nov 24, 2009 at 04:40:55PM +0000, Hugh Dickins wrote:
> > > When KSM merges an mlocked page, it has been forgetting to munlock it:
> > > that's been left to free_page_mlock(), which reports it in /proc/vmstat
> > > as unevictable_pgs_mlockfreed instead of unevictable_pgs_munlocked (and
> > > whinges "Page flag mlocked set for process" in mmotm, whereas mainline
> > > is silently forgiving). Call munlock_vma_page() to fix that.
> > >
> > > Signed-off-by: Hugh Dickins <[email protected]>
> >
> > Acked-by: Mel Gorman <[email protected]>
>
> Rik & Mel, thanks for the Acks.
>
> But please clarify: that patch was for mmotm and hopefully 2.6.33,
> but the vmstat issue (minus warning message) is there in 2.6.32-rc.
> Should I
>
> (a) forget it for 2.6.32
> (b) rush Linus a patch for 2.6.32 final
> (c) send a patch for 2.6.32.stable later on
I personally prefer (3). though I don't know ksm so detail.
>
> ? I just don't have a feel for how important this is.
>
> Typically, these pages are immediately freed, and the only issue is
> which stats they get added to; but if fork has copied them into other
> mms, then such pages might stay unevictable indefinitely, despite no
> longer being in any mlocked vma.
>
> There's a remark in munlock_vma_page(), apropos a different issue,
> /*
> * We lost the race. let try_to_unmap() deal
> * with it. At least we get the page state and
> * mlock stats right. However, page is still on
> * the noreclaim list. We'll fix that up when
> * the page is eventually freed or we scan the
> * noreclaim list.
> */
> which implies that sometimes we scan the unevictable list and resolve
> such cases. But I wonder if that's nowadays the case?
We don't scan unevictable list at all. munlock_vma_page() logic is.
1) clear PG_mlock always anyway
2) isolate page
3) scan related vma and remark PG_mlock if necessary
So, as far as I understand, the above comment describe the case when (2) is
failed. it mean another task already isolated the page. it makes the task
putback the page to evictable list and vmscan's try_to_unmap() move
the page to unevictable list again.
> On Tue, 24 Nov 2009 16:42:15 +0000 (GMT)
> Hugh Dickins <[email protected]> wrote:
> > +int page_referenced_ksm(struct page *page, struct mem_cgroup *memcg,
> > + unsigned long *vm_flags)
> > +{
> > + struct stable_node *stable_node;
> > + struct rmap_item *rmap_item;
> > + struct hlist_node *hlist;
> > + unsigned int mapcount = page_mapcount(page);
> > + int referenced = 0;
> > + struct vm_area_struct *vma;
> > +
> > + VM_BUG_ON(!PageKsm(page));
> > + VM_BUG_ON(!PageLocked(page));
> > +
> > + stable_node = page_stable_node(page);
> > + if (!stable_node)
> > + return 0;
> > +
>
> Hmm. I'm not sure how many pages are shared in a system but
> can't we add some threshold for avoidng too much scan against shared pages ?
> (in vmscan.c)
> like..
>
> if (page_mapcount(page) > (XXXX >> scan_priority))
> return 1;
>
> I saw terrible slow downs in shmem-swap-out in old RHELs (at user support).
> (Added kosaki to CC.)
>
> After this patch, the number of shared swappable page will be unlimited.
Probably, it doesn't matter. I mean
- KSM sharing and Shmem sharing are almost same performance characteristics.
- if memroy pressure is low, SplitLRU VM doesn't scan anon list so much.
if ksm swap is too costly, we need to improve anon list scanning generically.
btw, I'm not sure why bellow kmem_cache_zalloc() is necessary. Why can't we
use stack?
----------------------------
+ /*
+ * Temporary hack: really we need anon_vma in rmap_item, to
+ * provide the correct vma, and to find recently forked instances.
+ * Use zalloc to avoid weirdness if any other fields are involved.
+ */
+ vma = kmem_cache_zalloc(vm_area_cachep, GFP_ATOMIC);
+ if (!vma) {
+ spin_lock(&ksm_fallback_vma_lock);
+ vma = &ksm_fallback_vma;
+ }
On Mon, 30 Nov 2009, KAMEZAWA Hiroyuki wrote:
>
> Sorry for delayed response.
No, thank you very much for spending your time on it.
>
> On Tue, 24 Nov 2009 16:48:46 +0000 (GMT)
> Hugh Dickins <[email protected]> wrote:
>
> > When ksm pages were unswappable, it made no sense to include them in
> > mem cgroup accounting; but now that they are swappable (although I see
> > no strict logical connection)
> I asked that for throwing away too complicated but wast of time things.
I'm sorry, I didn't understand that sentence at all!
> If not on LRU, its own limitation (ksm's page limit) works enough.
Yes, I think it made sense the way it was before when unswappable,
but that once they're swappable and that limitation is removed,
they do then need to participate in mem cgroup accounting.
I _think_ you're agreeing, but I'm not quite sure!
>
> > the principle of least surprise implies
> > that they should be accounted (with the usual dissatisfaction, that a
> > shared page is accounted to only one of the cgroups using it).
> >
> > This patch was intended to add mem cgroup accounting where necessary;
> > but turned inside out, it now avoids allocating a ksm page, instead
> > upgrading an anon page to ksm - which brings its existing mem cgroup
> > accounting with it. Thus mem cgroups don't appear in the patch at all.
> >
> ok. then, what I should see is patch 6.
Well, that doesn't have much in it either. It should all be
happening naturally, from using the page that's already accounted.
> > @@ -864,15 +865,24 @@ static int try_to_merge_one_page(struct
...
> >
> > - if ((vma->vm_flags & VM_LOCKED) && !err) {
> > + if ((vma->vm_flags & VM_LOCKED) && kpage && !err) {
> > munlock_vma_page(page);
> > if (!PageMlocked(kpage)) {
> > unlock_page(page);
> > - lru_add_drain();
>
> Is this related to memcg ?
>
> > lock_page(kpage);
> > mlock_vma_page(kpage);
Is the removal of lru_add_drain() related to memcg? No, or only to
the extent that reusing the original anon page is related to memcg.
I put lru_add_drain() in there before, because (for one of the calls
to try_to_merge_one_page) the kpage had just been allocated an instant
before, with lru_cache_add_lru putting it into the per-cpu array, so
in that case mlock_vma_page(kpage) would need an lru_add_drain() to
find it on the LRU (of course, we might be preempted to a different
cpu in between, and lru_add_drain not be enough: but I think we've
all come to the conclusion that lru_add_drain_all should be avoided
unless there's a very strong reason for it).
But with this patch we're reusing the existing anon page as ksm page,
and we know that it's been in place for at least one circuit of ksmd
(ignoring coincidences like the jhash of the page happens to be 0),
so we've every reason to believe that it will already be on its LRU:
no need for lru_add_drain().
Hugh
On Mon, 30 Nov 2009, KAMEZAWA Hiroyuki wrote:
>
> Ok. Maybe commit_charge will work enough. (I hope so.)
Me too.
>
> Acked-by: KAMEZAWA Hiroyuki <[email protected]>
>
> BTW, I'm happy if you adds "How to test" documenation to
> Documentation/vm/ksm.txt or to share some test programs.
>
> 1. Map anonymous pages + madvise(MADV_MERGEABLE)
> 2. "echo 1 > /sys/kernel/mm/ksm/run"
Those are the main points, yes.
Though in testing for races, I do think the default
/sys/kernel/mm/ksm/sleep_millisecs 20 is probably too relaxed
to find issues quickly enough, so I usually change that to 0,
and also raise pages_to_scan from its default of 100 (though
that should matter less). In testing for races, something I've
not done but probably should, is raise the priority of ksmd: we
have it niced down, but that may leave some nasties unobserved.
As to adding Documentation on testing: whilst my primary reason
for not doing so is certainly laziness (or an interest in moving
on to somewhere else), a secondary reason is that I'd much rather
that if someone does have an interest in testing this, that they
follow their own ideas, rather than copying what's already done.
But here's something I'll share with you, please don't show it
to anyone else ;) Writing test programs using MADV_MERGEABLE is
good for testing specific issues, but can't give much coverage,
so I tend to run with this hack below: boot option "allksm" makes
as much as it can MADV_MERGEABLE. (If you wonder why I squashed it
up, it was to avoid changing the line numbering as much as possible.)
Hugh
--- mmotm/mm/mmap.c 2009-11-25 09:28:50.000000000 +0000
+++ allksm/mm/mmap.c 2009-11-25 11:19:13.000000000 +0000
@@ -902,9 +902,9 @@ void vm_stat_account(struct mm_struct *m
#endif /* CONFIG_PROC_FS */
/*
- * The caller must hold down_write(¤t->mm->mmap_sem).
- */
-
+ * The caller must hold down_write(¤t->mm->mmap_sem). */
+#include <linux/ksm.h>
+unsigned long vm_mergeable;
unsigned long do_mmap_pgoff(struct file *file, unsigned long addr,
unsigned long len, unsigned long prot,
unsigned long flags, unsigned long pgoff)
@@ -1050,7 +1050,7 @@ unsigned long do_mmap_pgoff(struct file
/*
* Set pgoff according to addr for anon_vma.
*/
- pgoff = addr >> PAGE_SHIFT;
+ vm_flags |= vm_mergeable; pgoff = addr >> PAGE_SHIFT;
break;
default:
return -EINVAL;
@@ -1201,10 +1201,10 @@ munmap_back:
vma->vm_file = file;
get_file(file);
error = file->f_op->mmap(file, vma);
- if (error)
- goto unmap_and_free_vma;
- if (vm_flags & VM_EXECUTABLE)
- added_exe_file_vma(mm);
+ if (error) goto unmap_and_free_vma;
+ if (vm_flags & VM_EXECUTABLE) added_exe_file_vma(mm);
+ if (vm_mergeable)
+ ksm_madvise(vma, 0, 0, MADV_MERGEABLE,&vma->vm_flags);
/* Can addr have changed??
*
@@ -2030,7 +2030,7 @@ unsigned long do_brk(unsigned long addr,
return error;
flags = VM_DATA_DEFAULT_FLAGS | VM_ACCOUNT | mm->def_flags;
-
+ flags |= vm_mergeable;
error = arch_mmap_check(addr, len, flags);
if (error)
return error;
@@ -2179,7 +2179,7 @@ int insert_vm_struct(struct mm_struct *
if (!vma->vm_file) {
BUG_ON(vma->anon_vma);
vma->vm_pgoff = vma->vm_start >> PAGE_SHIFT;
- }
+ vma->vm_flags |= vm_mergeable; }
__vma = find_vma_prepare(mm,vma->vm_start,&prev,&rb_link,&rb_parent);
if (__vma && __vma->vm_start < vma->vm_end)
return -ENOMEM;
@@ -2518,3 +2518,10 @@ void __init mmap_init(void)
ret = percpu_counter_init(&vm_committed_as, 0);
VM_BUG_ON(ret);
}
+static int __init allksm(char *s)
+{
+ randomize_va_space = 0;
+ vm_mergeable = VM_MERGEABLE;
+ return 1;
+}
+__setup("allksm", allksm);
On Mon, 30 Nov 2009, KAMEZAWA Hiroyuki wrote:
>
> Hmm. I'm not sure how many pages are shared in a system but
> can't we add some threshold for avoidng too much scan against shared pages ?
> (in vmscan.c)
> like..
>
> if (page_mapcount(page) > (XXXX >> scan_priority))
> return 1;
>
> I saw terrible slow downs in shmem-swap-out in old RHELs (at user support).
> (Added kosaki to CC.)
>
> After this patch, the number of shared swappable page will be unlimited.
I don't think KSM swapping changes the story here at all: I don't
think it significantly increases the likelihood of pages with very
high mapcounts on the LRUs. You've met the issue with shmem, okay,
I've always thought shared library text pages would be a problem.
I've often thought that some kind of "don't bother if the mapcount is
too high" check in vmscan.c might help - though I don't think I've
ever noticed the bugreport it would help with ;)
I used to imagine doing up to a certain number inside the rmap loops
and then breaking out (that would help with those reports of huge
anon_vma lists); but that would involve starting the next time from
where we left off, which would be difficult with the prio_tree.
Your proposal above (adjusting the limit according to scan_priority,
yes that's important) looks very promising to me.
Hugh
On Mon, Nov 30, 2009 at 09:46:16AM +0900, KAMEZAWA Hiroyuki wrote:
> Hmm. I'm not sure how many pages are shared in a system but
> can't we add some threshold for avoidng too much scan against shared pages ?
> (in vmscan.c)
> like..
>
> if (page_mapcount(page) > (XXXX >> scan_priority))
> return 1;
>
> I saw terrible slow downs in shmem-swap-out in old RHELs (at user support).
> (Added kosaki to CC.)
If those ptes are all old there's no reason to keep those pages in ram
more... I don't like those magic number levels. If you saw slowdowns
it'd be interesting to get more information on those workloads. I
never seen swap out workloads in real life that are not 99% I/O
dominated, there's nothing that loads the cpu anything close to 100%,
so nothing that a magic check like above could affect. Besides tmpfs
unmap methods are different from ksm and anon pages unmap methods, and
certain locks are coarser if there's userland taking i_mmap_lock for
I/O during paging.
> After this patch, the number of shared swappable page will be unlimited.
It is unlimited even without ksm, tmpfs may be limited but it's not
like we stop fork from sharing at some point and anon_vma is less
finegrined than rmap_item and it can also include in its list vmas not
mapping the page in presence of mremap/munmap partial truncation of
the copied/shared vma.
On Mon, 30 Nov 2009, KOSAKI Motohiro wrote:
> >
> > But please clarify: that patch was for mmotm and hopefully 2.6.33,
> > but the vmstat issue (minus warning message) is there in 2.6.32-rc.
> > Should I
> >
> > (a) forget it for 2.6.32
> > (b) rush Linus a patch for 2.6.32 final
> > (c) send a patch for 2.6.32.stable later on
>
> I personally prefer (3). though I don't know ksm so detail.
Thanks, I think that would be my preference by now too.
> > There's a remark in munlock_vma_page(), apropos a different issue,
> > /*
> > * We lost the race. let try_to_unmap() deal
> > * with it. At least we get the page state and
> > * mlock stats right. However, page is still on
> > * the noreclaim list. We'll fix that up when
> > * the page is eventually freed or we scan the
> > * noreclaim list.
> > */
> > which implies that sometimes we scan the unevictable list and resolve
> > such cases. But I wonder if that's nowadays the case?
>
> We don't scan unevictable list at all. munlock_vma_page() logic is.
>
> 1) clear PG_mlock always anyway
> 2) isolate page
> 3) scan related vma and remark PG_mlock if necessary
>
> So, as far as I understand, the above comment describe the case when (2) is
> failed. it mean another task already isolated the page. it makes the task
> putback the page to evictable list and vmscan's try_to_unmap() move
> the page to unevictable list again.
That is the case it's addressing, yes; but both references to
"the noreclaim list" are untrue and misleading (now: they may well
have been accurate when the comment went in). I'd like to correct
it, but cannot do so without spending the time to make sure that
what I'm saying instead isn't equally misleading...
Even "We lost the race" is worrying: which race? there might be several.
Hugh
On Mon, 30 Nov 2009, KOSAKI Motohiro wrote:
> > After this patch, the number of shared swappable page will be unlimited.
>
> Probably, it doesn't matter. I mean
>
> - KSM sharing and Shmem sharing are almost same performance characteristics.
> - if memroy pressure is low, SplitLRU VM doesn't scan anon list so much.
>
> if ksm swap is too costly, we need to improve anon list scanning generically.
Yes, we're in agreement that this issue is not new with KSM swapping.
> btw, I'm not sure why bellow kmem_cache_zalloc() is necessary. Why can't we
> use stack?
Well, I didn't use stack: partly because I'm so ashamed of the pseudo-vmas
on the stack in mm/shmem.c, which have put shmem_getpage() into reports
of high stack users (I've unfinished patches to deal with that); and
partly because page_referenced_ksm() and try_to_unmap_ksm() are on
the page reclaim path, maybe way down deep on a very deep stack.
But it's not something you or I should be worrying about: as the comment
says, this is just a temporary hack, to present a patch which gets KSM
swapping working in an understandable way, while leaving some corrections
and refinements to subsequent patches. This pseudo-vma is removed in the
very next patch.
Hugh
>
> ----------------------------
> + /*
> + * Temporary hack: really we need anon_vma in rmap_item, to
> + * provide the correct vma, and to find recently forked instances.
> + * Use zalloc to avoid weirdness if any other fields are involved.
> + */
> + vma = kmem_cache_zalloc(vm_area_cachep, GFP_ATOMIC);
> + if (!vma) {
> + spin_lock(&ksm_fallback_vma_lock);
> + vma = &ksm_fallback_vma;
> + }
On Mon, 2009-11-30 at 12:26 +0000, Hugh Dickins wrote:
> On Mon, 30 Nov 2009, KOSAKI Motohiro wrote:
> > >
> > > But please clarify: that patch was for mmotm and hopefully 2.6.33,
> > > but the vmstat issue (minus warning message) is there in 2.6.32-rc.
> > > Should I
> > >
> > > (a) forget it for 2.6.32
> > > (b) rush Linus a patch for 2.6.32 final
> > > (c) send a patch for 2.6.32.stable later on
> >
> > I personally prefer (3). though I don't know ksm so detail.
>
> Thanks, I think that would be my preference by now too.
>
> > > There's a remark in munlock_vma_page(), apropos a different issue,
> > > /*
> > > * We lost the race. let try_to_unmap() deal
> > > * with it. At least we get the page state and
> > > * mlock stats right. However, page is still on
> > > * the noreclaim list. We'll fix that up when
> > > * the page is eventually freed or we scan the
> > > * noreclaim list.
> > > */
> > > which implies that sometimes we scan the unevictable list and resolve
> > > such cases. But I wonder if that's nowadays the case?
> >
> > We don't scan unevictable list at all. munlock_vma_page() logic is.
> >
> > 1) clear PG_mlock always anyway
> > 2) isolate page
> > 3) scan related vma and remark PG_mlock if necessary
> >
> > So, as far as I understand, the above comment describe the case when (2) is
> > failed. it mean another task already isolated the page. it makes the task
> > putback the page to evictable list and vmscan's try_to_unmap() move
> > the page to unevictable list again.
>
> That is the case it's addressing, yes; but both references to
> "the noreclaim list" are untrue and misleading (now: they may well
> have been accurate when the comment went in). I'd like to correct
> it, but cannot do so without spending the time to make sure that
> what I'm saying instead isn't equally misleading...
>
> Even "We lost the race" is worrying: which race? there might be several.
I agree that this is likely a stale comment. At the time I wrote it,
putback_lru_page() didn't recheck whether the page was reclaimable [now
"evictable"]. isolate_lru_page() preserves the lru state flags Active
and Unevictable; at that time putback' just put the page back to the
list indicated.
"The race" referred to the "isolation race" discussed in the comment
block on munlock_vma_page().
Had we been munlock()ing or munmap()ing the last VMA holding the page
locked, we should take it off the unevictable list. But, we need to
isolate the page to move it between lists or even to call
try_to_unlock() to check whether there are other vmas holding the page
mlocked. If we were unable to isolate the page in munlock_vma_page()
and it were "putback" by whatever was holding it [page migration
maybe?], it would go back onto the unevictable list where it would be
stranded.
Now that we recheck the page state in putback_lru_page(), this shouldn't
be an issue. We've already cleared the Mlock page flag, so that
condition won't force it onto the unevictable list.
Even the part about try_to_unmap() dealing with it is stale. Now,
vmscan detects VM_LOCKED pages in page_referenced() before it gets to
try_to_unmap(). The function comment block needs updating as well. If
no one beats me to it, I'll post a cleanup patch for consideration
shortly.
Lee
>
> Hugh
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
On Mon, 30 Nov 2009 11:18:51 +0000 (GMT)
Hugh Dickins <[email protected]> wrote:
> On Mon, 30 Nov 2009, KAMEZAWA Hiroyuki wrote:
> >
> > Sorry for delayed response.
>
> No, thank you very much for spending your time on it.
>
> >
> > On Tue, 24 Nov 2009 16:48:46 +0000 (GMT)
> > Hugh Dickins <[email protected]> wrote:
> >
> > > When ksm pages were unswappable, it made no sense to include them in
> > > mem cgroup accounting; but now that they are swappable (although I see
> > > no strict logical connection)
> > I asked that for throwing away too complicated but wast of time things.
>
> I'm sorry, I didn't understand that sentence at all!
>
Sorry. At implementation of ksm. I don't want to consinder how to account it
because there was problems around swap accounting. So, I asked to
limit usage by itself.
> > If not on LRU, its own limitation (ksm's page limit) works enough.
>
> Yes, I think it made sense the way it was before when unswappable,
> but that once they're swappable and that limitation is removed,
> they do then need to participate in mem cgroup accounting.
>
> I _think_ you're agreeing, but I'm not quite sure!
>
I agree. No objections.
> > > @@ -864,15 +865,24 @@ static int try_to_merge_one_page(struct
> ...
> > >
> > > - if ((vma->vm_flags & VM_LOCKED) && !err) {
> > > + if ((vma->vm_flags & VM_LOCKED) && kpage && !err) {
> > > munlock_vma_page(page);
> > > if (!PageMlocked(kpage)) {
> > > unlock_page(page);
> > > - lru_add_drain();
> >
> > Is this related to memcg ?
> >
> > > lock_page(kpage);
> > > mlock_vma_page(kpage);
>
> Is the removal of lru_add_drain() related to memcg? No, or only to
> the extent that reusing the original anon page is related to memcg.
>
> I put lru_add_drain() in there before, because (for one of the calls
> to try_to_merge_one_page) the kpage had just been allocated an instant
> before, with lru_cache_add_lru putting it into the per-cpu array, so
> in that case mlock_vma_page(kpage) would need an lru_add_drain() to
> find it on the LRU (of course, we might be preempted to a different
> cpu in between, and lru_add_drain not be enough: but I think we've
> all come to the conclusion that lru_add_drain_all should be avoided
> unless there's a very strong reason for it).
>
> But with this patch we're reusing the existing anon page as ksm page,
> and we know that it's been in place for at least one circuit of ksmd
> (ignoring coincidences like the jhash of the page happens to be 0),
> so we've every reason to believe that it will already be on its LRU:
> no need for lru_add_drain().
>
Thank you for clarification.
Regards,
-Kame
On Mon, 30 Nov 2009 13:07:05 +0100
Andrea Arcangeli <[email protected]> wrote:
> On Mon, Nov 30, 2009 at 09:46:16AM +0900, KAMEZAWA Hiroyuki wrote:
> > Hmm. I'm not sure how many pages are shared in a system but
> > can't we add some threshold for avoidng too much scan against shared pages ?
> > (in vmscan.c)
> > like..
> >
> > if (page_mapcount(page) > (XXXX >> scan_priority))
> > return 1;
> >
> > I saw terrible slow downs in shmem-swap-out in old RHELs (at user support).
> > (Added kosaki to CC.)
>
> If those ptes are all old there's no reason to keep those pages in ram
> more... I don't like those magic number levels. If you saw slowdowns
> it'd be interesting to get more information on those workloads. I
> never seen swap out workloads in real life that are not 99% I/O
> dominated, there's nothing that loads the cpu anything close to 100%,
> so nothing that a magic check like above could affect.
I saw an user incident that all 64cpus hangs on shmem's spinlock and get
great slow down, cluster fail over.
As workaround, we recommend them to use hugepage. It's not scanned.
Hmm. Can KSM coalesce 10000+ of pages to a page ? In such case, lru
need to scan 10000+ ptes with 10000+ anon_vma->lock and 10000+ pte locks
for reclaiming a page.
> Besides tmpfs
> unmap methods are different from ksm and anon pages unmap methods, and
> certain locks are coarser if there's userland taking i_mmap_lock for
> I/O during paging.
>
maybe.
Hmm, Larry Woodman reports another? issue.
http://marc.info/?l=linux-mm&m=125961823921743&w=2
Maybe some modification to lru scanning is necessary independent from ksm.
I think.
Thanks,
-Kame
> > btw, I'm not sure why bellow kmem_cache_zalloc() is necessary. Why can't we
> > use stack?
>
> Well, I didn't use stack: partly because I'm so ashamed of the pseudo-vmas
> on the stack in mm/shmem.c, which have put shmem_getpage() into reports
> of high stack users (I've unfinished patches to deal with that); and
> partly because page_referenced_ksm() and try_to_unmap_ksm() are on
> the page reclaim path, maybe way down deep on a very deep stack.
>
> But it's not something you or I should be worrying about: as the comment
> says, this is just a temporary hack, to present a patch which gets KSM
> swapping working in an understandable way, while leaving some corrections
> and refinements to subsequent patches. This pseudo-vma is removed in the
> very next patch.
I see. thanks for kindly explanation :)
* KAMEZAWA Hiroyuki ([email protected]) wrote:
> Hmm. Can KSM coalesce 10000+ of pages to a page ?
Yes. The zero page is a prime example of this.
> In such case, lru
> need to scan 10000+ ptes with 10000+ anon_vma->lock and 10000+ pte locks
> for reclaiming a page.
Would likely be a poor choice too. With so many references it's likely
to be touched soon and swapped right back in.
thanks,
-chris
On Tue, Dec 01, 2009 at 09:39:45AM +0900, KAMEZAWA Hiroyuki wrote:
> Maybe some modification to lru scanning is necessary independent from ksm.
> I think.
It looks independent from ksm yes. Larry case especially has cpus
hanging in fork, and for those cpus to make progress it'd be enough to
release the anon_vma lock for a little while. I think counting the
number of young bits we found might be enough to fix this (at least
for anon_vma were we can easily randomize the ptes we scan). Let's
just break the rmap loop of page_referenced() after we cleared N young
bits. If we found so many young bits it's pointless to continue. It
still looks preferable than doing nothing or a full scan depending on
a magic mapcount value. It's preferable because we'll do real work
incrementally and we give a chance to heavily mapped but totally
unused pages to go away in perfect lru order.
Sure we can still end up with a 10000 length of anon_vma chain (or
rmap_item chain, or prio_tree scan) with all N young bits set in the
very last N vmas we check. But statistically with so many mappings
such a scenario has a very low probability to materialize. It's not
very useful to be so aggressive on a page where the young bits are
refreshed quick all the time because of plenty of mappings and many of
them using the page. If we do this, we've also to rotate the anon_vma
list too to start from a new vma, which globally it means randomizing
it. For anon_vma (and conceptually for ksm rmap_item, not sure in
implementation terms) it's trivial to rotate to randomize the young
bit scan. For prio_tree (that includes tmpfs) it's much harder.
In addition to returning 1 every N young bit cleared, we should
ideally also have a spin_needbreak() for the rmap lock so things like
fork can continue against page_referenced one and try_to_unmap
too. Even for the prio_tree we could record the prio_tree position on
the stack and we can add a bit that signals when the prio_tree got
modified under us. But if the rmap structure modified from under us
we're in deep trouble: after that we have to either restart from
scratch (risking a livelock in page_referenced(), so not really
feasible) or alternatively to return 1 breaking the loop which would
make the VM less reliable (which means we would be increasing the
probability of a suprious OOM) . Somebody could just mmap the hugely
mapped file from another task in a loop, and prevent the
page_referenced_one and try_to_unmap to ever complete on all pages of
that file! So I don't really know how to implement the spin_needbreak
without making the VM exploitable. But I'm quite confident there is no
way the below can make the VM less reliable, and the spin_needbreak is
much less relevant for anon_vma than it is for prio_tree because it's
trivial to randomize the ptes we scan for young bit with
anon_vma. Maybe this also is enough to fix tmpfs.
Signed-off-by: Andrea Arcangeli <[email protected]>
---
diff --git a/mm/rmap.c b/mm/rmap.c
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -60,6 +60,8 @@
#include "internal.h"
+#define MAX_YOUNG_BIT_CLEARED 64
+
static struct kmem_cache *anon_vma_cachep;
static inline struct anon_vma *anon_vma_alloc(void)
@@ -420,6 +422,24 @@ static int page_referenced_anon(struct p
&mapcount, vm_flags);
if (!mapcount)
break;
+
+ /*
+ * Break the loop early if we found many active
+ * mappings and go deep into the long chain only if
+ * this looks a fully unused page. Otherwise we only
+ * waste this cpu and we hang other CPUs too that
+ * might be waiting on our lock to be released.
+ */
+ if (referenced >= MAX_YOUNG_BIT_CLEARED) {
+ /*
+ * randomize the MAX_YOUNG_BIT_CLEARED ptes
+ * that we scan at every page_referenced_one()
+ * call on this page.
+ */
+ list_del(&anon_vma->head);
+ list_add(&anon_vma->head, &vma->anon_vma_node);
+ break;
+ }
}
page_unlock_anon_vma(anon_vma);
@@ -485,6 +505,16 @@ static int page_referenced_file(struct p
&mapcount, vm_flags);
if (!mapcount)
break;
+
+ /*
+ * Break the loop early if we found many active
+ * mappings and go deep into the long chain only if
+ * this looks a fully unused page. Otherwise we only
+ * waste this cpu and we hang other CPUs too that
+ * might be waiting on our lock to be released.
+ */
+ if (referenced >= MAX_YOUNG_BIT_CLEARED)
+ break;
}
spin_unlock(&mapping->i_mmap_lock);
> On Tue, Dec 01, 2009 at 09:39:45AM +0900, KAMEZAWA Hiroyuki wrote:
> > Maybe some modification to lru scanning is necessary independent from ksm.
> > I think.
>
> It looks independent from ksm yes. Larry case especially has cpus
> hanging in fork, and for those cpus to make progress it'd be enough to
> release the anon_vma lock for a little while. I think counting the
> number of young bits we found might be enough to fix this (at least
> for anon_vma were we can easily randomize the ptes we scan). Let's
> just break the rmap loop of page_referenced() after we cleared N young
> bits. If we found so many young bits it's pointless to continue. It
> still looks preferable than doing nothing or a full scan depending on
> a magic mapcount value. It's preferable because we'll do real work
> incrementally and we give a chance to heavily mapped but totally
> unused pages to go away in perfect lru order.
>
> Sure we can still end up with a 10000 length of anon_vma chain (or
> rmap_item chain, or prio_tree scan) with all N young bits set in the
> very last N vmas we check. But statistically with so many mappings
> such a scenario has a very low probability to materialize. It's not
> very useful to be so aggressive on a page where the young bits are
> refreshed quick all the time because of plenty of mappings and many of
> them using the page. If we do this, we've also to rotate the anon_vma
> list too to start from a new vma, which globally it means randomizing
> it. For anon_vma (and conceptually for ksm rmap_item, not sure in
> implementation terms) it's trivial to rotate to randomize the young
> bit scan. For prio_tree (that includes tmpfs) it's much harder.
>
> In addition to returning 1 every N young bit cleared, we should
> ideally also have a spin_needbreak() for the rmap lock so things like
> fork can continue against page_referenced one and try_to_unmap
> too. Even for the prio_tree we could record the prio_tree position on
> the stack and we can add a bit that signals when the prio_tree got
> modified under us. But if the rmap structure modified from under us
> we're in deep trouble: after that we have to either restart from
> scratch (risking a livelock in page_referenced(), so not really
> feasible) or alternatively to return 1 breaking the loop which would
> make the VM less reliable (which means we would be increasing the
> probability of a suprious OOM) . Somebody could just mmap the hugely
> mapped file from another task in a loop, and prevent the
> page_referenced_one and try_to_unmap to ever complete on all pages of
> that file! So I don't really know how to implement the spin_needbreak
> without making the VM exploitable. But I'm quite confident there is no
> way the below can make the VM less reliable, and the spin_needbreak is
> much less relevant for anon_vma than it is for prio_tree because it's
> trivial to randomize the ptes we scan for young bit with
> anon_vma. Maybe this also is enough to fix tmpfs.
>
> Signed-off-by: Andrea Arcangeli <[email protected]>
> ---
>
> diff --git a/mm/rmap.c b/mm/rmap.c
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -60,6 +60,8 @@
>
> #include "internal.h"
>
> +#define MAX_YOUNG_BIT_CLEARED 64
> +
> static struct kmem_cache *anon_vma_cachep;
>
> static inline struct anon_vma *anon_vma_alloc(void)
> @@ -420,6 +422,24 @@ static int page_referenced_anon(struct p
> &mapcount, vm_flags);
> if (!mapcount)
> break;
> +
> + /*
> + * Break the loop early if we found many active
> + * mappings and go deep into the long chain only if
> + * this looks a fully unused page. Otherwise we only
> + * waste this cpu and we hang other CPUs too that
> + * might be waiting on our lock to be released.
> + */
> + if (referenced >= MAX_YOUNG_BIT_CLEARED) {
> + /*
> + * randomize the MAX_YOUNG_BIT_CLEARED ptes
> + * that we scan at every page_referenced_one()
> + * call on this page.
> + */
> + list_del(&anon_vma->head);
> + list_add(&anon_vma->head, &vma->anon_vma_node);
> + break;
> + }
> }
This patch doesn't works correctly. shrink_active_list() use page_referenced() for
clear young bit and doesn't use return value.
after this patch apply, shrink_active_list() move the page to inactive list although
the page still have many young bit. then, next shrink_inactive_list() move the page
to active list again.
> page_unlock_anon_vma(anon_vma);
> @@ -485,6 +505,16 @@ static int page_referenced_file(struct p
> &mapcount, vm_flags);
> if (!mapcount)
> break;
> +
> + /*
> + * Break the loop early if we found many active
> + * mappings and go deep into the long chain only if
> + * this looks a fully unused page. Otherwise we only
> + * waste this cpu and we hang other CPUs too that
> + * might be waiting on our lock to be released.
> + */
> + if (referenced >= MAX_YOUNG_BIT_CLEARED)
> + break;
> }
>
> spin_unlock(&mapping->i_mmap_lock);
On Tue, Dec 01, 2009 at 06:28:16PM +0900, KOSAKI Motohiro wrote:
> This patch doesn't works correctly. shrink_active_list() use page_referenced() for
> clear young bit and doesn't use return value.
The whole point is that it's inefficient to clear all young bits just
to move it to inactive list in the hope that new young bits will be
set right before the page reaches the end of the inactive list.
> after this patch apply, shrink_active_list() move the page to inactive list although
> the page still have many young bit. then, next shrink_inactive_list() move the page
> to active list again.
yes it's not the end of the world, this only alter behavior for pages
that have plenty of mappings. However I still it's inefficient to
pretend to clear all young bits at once when page is deactivated. But
this is not something I'm interested to argue about... let do what you
like there, but as long as you pretend to clear all dirty bits there
is no way we can fix anything. Plus we should touch ptes only in
presence of heavy memory pressure, with light memory pressure ptes
should _never_ be touched, and we should only shrink unmapped
cache. And active/inactive movements must still happen even in
presence of light memory pressure. The reason is that with light
memory pressure we're not I/O bound and we don't want to waste time
there. My patch is ok, what is not ok is the rest, you got to change
the rest to deal with this.
> On Tue, Dec 01, 2009 at 06:28:16PM +0900, KOSAKI Motohiro wrote:
> > This patch doesn't works correctly. shrink_active_list() use page_referenced() for
> > clear young bit and doesn't use return value.
>
> The whole point is that it's inefficient to clear all young bits just
> to move it to inactive list in the hope that new young bits will be
> set right before the page reaches the end of the inactive list.
>
> > after this patch apply, shrink_active_list() move the page to inactive list although
> > the page still have many young bit. then, next shrink_inactive_list() move the page
> > to active list again.
>
> yes it's not the end of the world, this only alter behavior for pages
> that have plenty of mappings. However I still it's inefficient to
> pretend to clear all young bits at once when page is deactivated. But
> this is not something I'm interested to argue about... let do what you
> like there, but as long as you pretend to clear all dirty bits there
> is no way we can fix anything. Plus we should touch ptes only in
> presence of heavy memory pressure, with light memory pressure ptes
> should _never_ be touched, and we should only shrink unmapped
> cache. And active/inactive movements must still happen even in
> presence of light memory pressure. The reason is that with light
> memory pressure we're not I/O bound and we don't want to waste time
> there. My patch is ok, what is not ok is the rest, you got to change
> the rest to deal with this.
Ah, well. please wait a bit. I'm under reviewing Larry's patch. I don't
dislike your idea. last mail only pointed out implementation thing.
On Tue, Dec 01, 2009 at 06:46:06PM +0900, KOSAKI Motohiro wrote:
> Ah, well. please wait a bit. I'm under reviewing Larry's patch. I don't
> dislike your idea. last mail only pointed out implementation thing.
Yep thanks for pointing it out. It's an implementation thing I don't
like. The VM should not ever touch ptes when there's light VM pressure
and plenty of unmapped clean cache available, but I'm ok if others
disagree and want to keep it that way.
On Fri, Nov 27, 2009 at 12:45:04PM +0000, Hugh Dickins wrote:
> On Thu, 26 Nov 2009, Mel Gorman wrote:
> > On Tue, Nov 24, 2009 at 04:40:55PM +0000, Hugh Dickins wrote:
> > > When KSM merges an mlocked page, it has been forgetting to munlock it:
> > > that's been left to free_page_mlock(), which reports it in /proc/vmstat
> > > as unevictable_pgs_mlockfreed instead of unevictable_pgs_munlocked (and
> > > whinges "Page flag mlocked set for process" in mmotm, whereas mainline
> > > is silently forgiving). Call munlock_vma_page() to fix that.
> > >
> > > Signed-off-by: Hugh Dickins <[email protected]>
> >
> > Acked-by: Mel Gorman <[email protected]>
>
> Rik & Mel, thanks for the Acks.
>
> But please clarify: that patch was for mmotm and hopefully 2.6.33,
> but the vmstat issue (minus warning message) is there in 2.6.32-rc.
> Should I
>
> (a) forget it for 2.6.32
> (b) rush Linus a patch for 2.6.32 final
> (c) send a patch for 2.6.32.stable later on
>
> ? I just don't have a feel for how important this is.
>
My ack was based on the view that pages should not be getting to the buddy
allocator with the mlocked bit set. It only warns in -mm because it's meant
to be harmless-if-incorrect in all cases. Based on my reading of your
patch, it looked like a reasonable way of clearing the locked bit that
deal with the same type of isolation races typically faced by reclaim.
I felt it would be a case that either the isolation failed and it would
end up back on the LRU list or it would remain on whatever unevitable
LRU list it previously existed on where it would be found there.
> Typically, these pages are immediately freed, and the only issue is
> which stats they get added to; but if fork has copied them into other
> mms, then such pages might stay unevictable indefinitely, despite no
> longer being in any mlocked vma.
>
> There's a remark in munlock_vma_page(), apropos a different issue,
> /*
> * We lost the race. let try_to_unmap() deal
> * with it. At least we get the page state and
> * mlock stats right. However, page is still on
> * the noreclaim list. We'll fix that up when
> * the page is eventually freed or we scan the
> * noreclaim list.
> */
> which implies that sometimes we scan the unevictable list and resolve
> such cases. But I wonder if that's nowadays the case?
>
My understanding was that if it failed to isolate then another process had
already done the necessary work and dropped the reference. The page would
then get properly freed at the last put_page. I did not double check this
assumption.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
On 12/01/2009 04:59 AM, Andrea Arcangeli wrote:
> On Tue, Dec 01, 2009 at 06:46:06PM +0900, KOSAKI Motohiro wrote:
>
>> Ah, well. please wait a bit. I'm under reviewing Larry's patch. I don't
>> dislike your idea. last mail only pointed out implementation thing.
>>
> Yep thanks for pointing it out. It's an implementation thing I don't
> like. The VM should not ever touch ptes when there's light VM pressure
> and plenty of unmapped clean cache available, but I'm ok if others
> disagree and want to keep it that way.
>
The VM needs to touch a few (but only a few) PTEs in
that situation, to make sure that anonymous pages get
moved to the inactive anon list and get to a real chance
at being referenced before we try to evict anonymous
pages.
Without a small amount of pre-aging, we would end up
essentially doing FIFO replacement of anonymous memory,
which has been known to be disastrous to performance
for over 40 years now.
A two-handed clock mechanism needs to put some distance
between the front and the back hands of the clock.
Having said that - it may be beneficial to keep very heavily
shared pages on the active list, without ever trying to scan
the ptes associated with them.
On Wed, Dec 02, 2009 at 12:08:18AM -0500, Rik van Riel wrote:
> The VM needs to touch a few (but only a few) PTEs in
> that situation, to make sure that anonymous pages get
> moved to the inactive anon list and get to a real chance
> at being referenced before we try to evict anonymous
> pages.
>
> Without a small amount of pre-aging, we would end up
> essentially doing FIFO replacement of anonymous memory,
> which has been known to be disastrous to performance
> for over 40 years now.
So far the only kernel that hangs in fork is the newer one...
In general I cannot care less about FIFO, I care about no CPU waste on
100% of my systems were swap is not needed. All my unmapped cache is
100% garbage collectable, and there is never any reason to flush any
tlb and walk the rmap chain. Give me a knob to disable the CPU waste
given I know what is going on, on my systems. I am totally ok with
slightly slower swap performance and fifo replacement in case I
eventually hit swap for a little while, then over time if memory
pressure stays high swap behavior will improve regardless of
flooding ipis to clear young bit when there are hundred gigabytes of
freeaeble cache unmapped and clean.
> Having said that - it may be beneficial to keep very heavily
> shared pages on the active list, without ever trying to scan
> the ptes associated with them.
Just mapped pages in general, not heavily... The other thing that is
beneficial likely is to stop page_referenced after 64 young bit clear,
that is referenced enough, you can enable this under my knob so that
it won't screw your algorithm. I don't have 1 terabyte of memory, so
you don't have to worry for me, I just want every cycle out of my cpu
without having to use O_DIRECT all the time.
> On Wed, Dec 02, 2009 at 12:08:18AM -0500, Rik van Riel wrote:
> > The VM needs to touch a few (but only a few) PTEs in
> > that situation, to make sure that anonymous pages get
> > moved to the inactive anon list and get to a real chance
> > at being referenced before we try to evict anonymous
> > pages.
> >
> > Without a small amount of pre-aging, we would end up
> > essentially doing FIFO replacement of anonymous memory,
> > which has been known to be disastrous to performance
> > for over 40 years now.
>
> So far the only kernel that hangs in fork is the newer one...
>
> In general I cannot care less about FIFO, I care about no CPU waste on
> 100% of my systems were swap is not needed. All my unmapped cache is
> 100% garbage collectable, and there is never any reason to flush any
> tlb and walk the rmap chain. Give me a knob to disable the CPU waste
> given I know what is going on, on my systems. I am totally ok with
> slightly slower swap performance and fifo replacement in case I
> eventually hit swap for a little while, then over time if memory
> pressure stays high swap behavior will improve regardless of
> flooding ipis to clear young bit when there are hundred gigabytes of
> freeaeble cache unmapped and clean.
>
> > Having said that - it may be beneficial to keep very heavily
> > shared pages on the active list, without ever trying to scan
> > the ptes associated with them.
>
> Just mapped pages in general, not heavily... The other thing that is
> beneficial likely is to stop page_referenced after 64 young bit clear,
> that is referenced enough, you can enable this under my knob so that
> it won't screw your algorithm. I don't have 1 terabyte of memory, so
> you don't have to worry for me, I just want every cycle out of my cpu
> without having to use O_DIRECT all the time.
Umm?? Personally I don't like knob. If you have problematic workload,
please tell it us. I will try to make reproduce environment on my box.
If current code doesn't works on KVM or something-else, I really want
to fix it.
I think Larry's trylock idea and your 64 young bit idea can be combinate.
I only oppose the page move to inactive list without clear young bit. IOW,
if VM pressure is very low and the page have lots young bit, the page should
go back active list although trylock(ptelock) isn't contended.
But unfortunatelly I don't have problem workload as you mentioned. Anyway
we need evaluate way to your idea. We obviouslly more info.
> Umm?? Personally I don't like knob. If you have problematic workload,
> please tell it us. I will try to make reproduce environment on my box.
> If current code doesn't works on KVM or something-else, I really want
> to fix it.
>
> I think Larry's trylock idea and your 64 young bit idea can be combinate.
> I only oppose the page move to inactive list without clear young bit. IOW,
> if VM pressure is very low and the page have lots young bit, the page should
> go back active list although trylock(ptelock) isn't contended.
>
> But unfortunatelly I don't have problem workload as you mentioned. Anyway
> we need evaluate way to your idea. We obviouslly more info.
[Off topic start]
Windows kernel have zero page thread and it clear the pages in free list
periodically. because many windows subsystem prerefer zero filled page.
hen, if we use windows guest, zero filled page have plenty mapcount rather
than other typical sharing pages, I guess.
So, can we mark as unevictable to zero filled ksm page?
On Fri, 4 Dec 2009 14:06:07 +0900 (JST)
KOSAKI Motohiro <[email protected]> wrote:
> > Umm?? Personally I don't like knob. If you have problematic workload,
> > please tell it us. I will try to make reproduce environment on my box.
> > If current code doesn't works on KVM or something-else, I really want
> > to fix it.
> >
> > I think Larry's trylock idea and your 64 young bit idea can be combinate.
> > I only oppose the page move to inactive list without clear young bit. IOW,
> > if VM pressure is very low and the page have lots young bit, the page should
> > go back active list although trylock(ptelock) isn't contended.
> >
> > But unfortunatelly I don't have problem workload as you mentioned. Anyway
> > we need evaluate way to your idea. We obviouslly more info.
>
> [Off topic start]
>
> Windows kernel have zero page thread and it clear the pages in free list
> periodically. because many windows subsystem prerefer zero filled page.
> hen, if we use windows guest, zero filled page have plenty mapcount rather
> than other typical sharing pages, I guess.
>
> So, can we mark as unevictable to zero filled ksm page?
>
Hmm, can't we use ZERO_PAGE we have now ?
If do so,
- no mapcount check
- never on LRU
- don't have to maintain shared information because ZERO_PAGE itself has
copy-on-write nature.
Thanks,
-Kame
On Fri, Dec 04, 2009 at 02:06:07PM +0900, KOSAKI Motohiro wrote:
> Windows kernel have zero page thread and it clear the pages in free list
> periodically. because many windows subsystem prerefer zero filled page.
> hen, if we use windows guest, zero filled page have plenty mapcount rather
> than other typical sharing pages, I guess.
>
> So, can we mark as unevictable to zero filled ksm page?
I don't like magics for zero ksm page, or magic number after which we
consider unevictable.
Just breaking the loop after 64 young are cleared and putting it back
to the head of the active list is enough. Clearly it requires a bit
more changes to fit into current code that uses page_referenced to
clear all young bits ignoring if they were set during the clear loop.
I think it's fishy to ignore the page_referenced retval and I don't
like the wipe_page_referenced concept. page_referenced should only be
called when we're in presence of VM pressure that requires
unmapping. And we should always re-add the page to active list head,
if it was found referenced as retval of page_referenced. I cannot care
less about first swapout burst to be FIFO because it'll be close to
FIFO anyway. The wipe_page_referenced thing was called 1 year ago
shortly after the page was allocated, then app touches the page after
it's in inactive anon, and then the app never touches the page again
for one year. And yet we consider it active after 1 year we cleared
its referenced bit. It's all very fishy... Plus that VM_EXEC is still
there. The only magic allowed that I advocate is to have a
page_mapcount() check to differentiate between pure cache pollution
(i.e. to avoid being forced to O_DIRECT without actually activating
unnecessary VM activity on mapped pages that aren't pure cache
pollution by somebody running a backup with tar).
On Fri, Dec 04, 2009 at 02:16:17PM +0900, KAMEZAWA Hiroyuki wrote:
> Hmm, can't we use ZERO_PAGE we have now ?
> If do so,
> - no mapcount check
> - never on LRU
> - don't have to maintain shared information because ZERO_PAGE itself has
> copy-on-write nature.
The zero page could be added to the stable tree always to avoid a
memcmp and we could try to merge anon pages into it, instead of
merging it into ksmpages, but it's not a ksm page so it would require
special handling with branches. We considered doing a magic on
zeropage but we though it's not worth it. We need CPU to be efficient
on very shared pages not just zero page without magics, and the memory
saving is just 4k system-wide (all zero pages of all windows are
already shared).
On 12/04/2009 09:45 AM, Andrea Arcangeli wrote:
> I think it's fishy to ignore the page_referenced retval and I don't
> like the wipe_page_referenced concept. page_referenced should only be
> called when we're in presence of VM pressure that requires
> unmapping. And we should always re-add the page to active list head,
> if it was found referenced as retval of page_referenced.
You are wrong here, for scalability reasons I explained
to you half a dozen times before :)
I agree with the rest of your email, though.
--
All rights reversed.
* KAMEZAWA Hiroyuki ([email protected]) wrote:
> KOSAKI Motohiro <[email protected]> wrote:
> > Windows kernel have zero page thread and it clear the pages in free list
> > periodically. because many windows subsystem prerefer zero filled page.
> > hen, if we use windows guest, zero filled page have plenty mapcount rather
> > than other typical sharing pages, I guess.
> >
> > So, can we mark as unevictable to zero filled ksm page?
That's why I mentioned the page of zeroes as the prime example of
something with a high mapcount that shouldn't really ever be evicted.
> Hmm, can't we use ZERO_PAGE we have now ?
> If do so,
> - no mapcount check
> - never on LRU
> - don't have to maintain shared information because ZERO_PAGE itself has
> copy-on-write nature.
It's a somewhat special case, but wouldn't it be useful to have a generic
method to recognize this kind of sharing since it's a generic issue?
thanks,
-chris
On Fri, Dec 04, 2009 at 09:16:40AM -0800, Chris Wright wrote:
> That's why I mentioned the page of zeroes as the prime example of
> something with a high mapcount that shouldn't really ever be evicted.
Just a nitpick, "never" is too much, it should remain evictable if
somebody halts all VM from monitor and starts a workloads that fills
RAM and runs for a very prolonged time pushing all VM into swap. This
is especially true if we stick to the below approach and it isn't
just 1 page in high-sharing.
> It's a somewhat special case, but wouldn't it be useful to have a generic
> method to recognize this kind of sharing since it's a generic issue?
Agreed.
* Andrea Arcangeli ([email protected]) wrote:
> On Fri, Dec 04, 2009 at 09:16:40AM -0800, Chris Wright wrote:
> > That's why I mentioned the page of zeroes as the prime example of
> > something with a high mapcount that shouldn't really ever be evicted.
>
> Just a nitpick, "never" is too much, it should remain evictable if
> somebody halts all VM from monitor and starts a workloads that fills
> RAM and runs for a very prolonged time pushing all VM into swap. This
> is especially true if we stick to the below approach and it isn't
> just 1 page in high-sharing.
Yup, I completely agree, that's what I was trying to convey by
"shouldn't really ever" ;-)
thanks,
-chris
On Fri, 4 Dec 2009 09:16:40 -0800
Chris Wright <[email protected]> wrote:
> * KAMEZAWA Hiroyuki ([email protected]) wrote:
> > KOSAKI Motohiro <[email protected]> wrote:
> > > Windows kernel have zero page thread and it clear the pages in free list
> > > periodically. because many windows subsystem prerefer zero filled page.
> > > hen, if we use windows guest, zero filled page have plenty mapcount rather
> > > than other typical sharing pages, I guess.
> > >
> > > So, can we mark as unevictable to zero filled ksm page?
>
> That's why I mentioned the page of zeroes as the prime example of
> something with a high mapcount that shouldn't really ever be evicted.
>
> > Hmm, can't we use ZERO_PAGE we have now ?
> > If do so,
> > - no mapcount check
> > - never on LRU
> > - don't have to maintain shared information because ZERO_PAGE itself has
> > copy-on-write nature.
>
> It's a somewhat special case, but wouldn't it be useful to have a generic
> method to recognize this kind of sharing since it's a generic issue?
>
I just remembered that why ZERO_PAGE was removed (in past). It was becasue
cache-line ping-pong at fork beacause of page->mapcount. And KSM introduces
zero-pages which have mapcount again. If no problems in realitsitc usage of
KVM, ignore me.
Thanks,
-Kame
* KAMEZAWA Hiroyuki ([email protected]) wrote:
> On Fri, 4 Dec 2009 09:16:40 -0800
> Chris Wright <[email protected]> wrote:
> > * KAMEZAWA Hiroyuki ([email protected]) wrote:
> > > Hmm, can't we use ZERO_PAGE we have now ?
> > > If do so,
> > > - no mapcount check
> > > - never on LRU
> > > - don't have to maintain shared information because ZERO_PAGE itself has
> > > copy-on-write nature.
> >
> > It's a somewhat special case, but wouldn't it be useful to have a generic
> > method to recognize this kind of sharing since it's a generic issue?
>
> I just remembered that why ZERO_PAGE was removed (in past). It was becasue
> cache-line ping-pong at fork beacause of page->mapcount. And KSM introduces
> zero-pages which have mapcount again. If no problems in realitsitc usage of
> KVM, ignore me.
KVM is not exactly fork heavy (although it's not the only possible user
of KSM). And the CoW path has fault + copy already.
Semi-related...it can make good sense to make the KSM trees per NUMA
node. Would mean things like page of zeroes would collapse to number
of NUMA nodes pages rather than a single page, but has the benefit of
not adding remote access (although, probably more useful for text pages
than zero pages).
thanks,
-chris
On Wed, Dec 09, 2009 at 09:43:31AM +0900, KAMEZAWA Hiroyuki wrote:
> cache-line ping-pong at fork beacause of page->mapcount. And KSM introduces
> zero-pages which have mapcount again. If no problems in realitsitc usage of
> KVM, ignore me.
The whole memory marked MADV_MERGEABLE by KVM is also marked
MADV_DONTFORK, so if KVM was to fork (and if it did, if it wasn't for
MADV_DONTFORK, it would also trigger all O_DIRECT vs fork race
conditions too, as KVM is one of the many apps that uses threads and
O_DIRECT - we try not to fork though but we sure did in the past), no
slowdown could ever happen in mapcount because of KSM, all KSM pages
aren't visibile by child.
It's still something to keep in mind for other KSM users, but I don't
think mapcount is big deal if compared to the risk of triggering COWs
later on those pages, in general KSM is all about saving tons of
memory at the expense of some CPU cycle (kksmd, cows, mapcount with
parallel forks etc...).
On Wed, 9 Dec 2009 17:12:19 +0100
Andrea Arcangeli <[email protected]> wrote:
> On Wed, Dec 09, 2009 at 09:43:31AM +0900, KAMEZAWA Hiroyuki wrote:
> > cache-line ping-pong at fork beacause of page->mapcount. And KSM introduces
> > zero-pages which have mapcount again. If no problems in realitsitc usage of
> > KVM, ignore me.
>
> The whole memory marked MADV_MERGEABLE by KVM is also marked
> MADV_DONTFORK, so if KVM was to fork (and if it did, if it wasn't for
> MADV_DONTFORK, it would also trigger all O_DIRECT vs fork race
> conditions too, as KVM is one of the many apps that uses threads and
> O_DIRECT - we try not to fork though but we sure did in the past), no
> slowdown could ever happen in mapcount because of KSM, all KSM pages
> aren't visibile by child.
>
> It's still something to keep in mind for other KSM users, but I don't
> think mapcount is big deal if compared to the risk of triggering COWs
> later on those pages, in general KSM is all about saving tons of
> memory at the expense of some CPU cycle (kksmd, cows, mapcount with
> parallel forks etc...).
>
Okay, thank you for kindlt explanation.
and sorry for noise.
Thanks,
-Kame