LinuxLists.cc - [PATCH 0/6] mm: prepare for ksm swapping

2009-11-10 21:50:07

Subject: [PATCH 0/6] mm: prepare for ksm swapping

Here's a series of six miscellaneous mm patches against 2.6.32-rc5-mm1,
intended to follow my earlier swap_info patches, or slot in just before
mmend in the mmotm series.

Apart from the sixth, they have some relevance to the KSM-page swapping
patches, following after a few days. They clear away some mm cruft,
to let that series concentrate on ksm.c; but should stand on their own.

include/linux/ksm.h | 5 -
include/linux/mm.h | 17 +++
include/linux/page-flags.h | 8 -
include/linux/rmap.h | 8 +
mm/Kconfig | 14 --
mm/internal.h | 26 ++---
mm/memory-failure.c | 2
mm/memory.c | 4
mm/migrate.c | 11 --
mm/mlock.c | 2
mm/page_alloc.c | 4
mm/rmap.c | 174 +++++++++++------------------------
mm/swapfile.c | 2
13 files changed, 110 insertions(+), 167 deletions(-)

Hugh

2009-11-10 21:51:56

by Hugh Dickins

[permalink] [raw]

Subject: [PATCH 1/6] mm: define PAGE_MAPPING_FLAGS

At present we define PageAnon(page) by the low PAGE_MAPPING_ANON bit
set in page->mapping, with the higher bits a pointer to the anon_vma;
and have defined PageKsm(page) as that with NULL anon_vma.

But KSM swapping will need to store a pointer there: so in preparation
for that, now define PAGE_MAPPING_FLAGS as the low two bits, including
PAGE_MAPPING_KSM (always set along with PAGE_MAPPING_ANON, until some
other use for the bit emerges).

Declare page_rmapping(page) to return the pointer part of page->mapping,
and page_anon_vma(page) to return the anon_vma pointer when that's what
it is. Use these in a few appropriate places: notably, unuse_vma() has
been testing page->mapping, but is better to be testing page_anon_vma()
(cases may be added in which flag bits are set without any pointer).

Signed-off-by: Hugh Dickins <[email protected]>
---

include/linux/ksm.h | 5 +++--
include/linux/mm.h | 17 ++++++++++++++++-
include/linux/rmap.h | 8 ++++++++
mm/migrate.c | 11 ++++-------
mm/rmap.c | 7 +++----
mm/swapfile.c | 2 +-
6 files changed, 35 insertions(+), 15 deletions(-)

--- mm0/include/linux/ksm.h 2009-09-28 00:28:38.000000000 +0100
+++ mm1/include/linux/ksm.h 2009-11-04 10:52:45.000000000 +0000
@@ -38,7 +38,8 @@ static inline void ksm_exit(struct mm_st
*/
static inline int PageKsm(struct page *page)
{
- return ((unsigned long)page->mapping == PAGE_MAPPING_ANON);
+ return ((unsigned long)page->mapping & PAGE_MAPPING_FLAGS) ==
+ (PAGE_MAPPING_ANON | PAGE_MAPPING_KSM);
}

/*
@@ -47,7 +48,7 @@ static inline int PageKsm(struct page *p
static inline void page_add_ksm_rmap(struct page *page)
{
if (atomic_inc_and_test(&page->_mapcount)) {
- page->mapping = (void *) PAGE_MAPPING_ANON;
+ page->mapping = (void *) (PAGE_MAPPING_ANON | PAGE_MAPPING_KSM);
__inc_zone_page_state(page, NR_ANON_PAGES);
}
}
--- mm0/include/linux/mm.h 2009-11-02 12:32:34.000000000 +0000
+++ mm1/include/linux/mm.h 2009-11-04 10:52:45.000000000 +0000
@@ -620,13 +620,22 @@ void page_address_init(void);
/*
* On an anonymous page mapped into a user virtual memory area,
* page->mapping points to its anon_vma, not to a struct address_space;
- * with the PAGE_MAPPING_ANON bit set to distinguish it.
+ * with the PAGE_MAPPING_ANON bit set to distinguish it. See rmap.h.
+ *
+ * On an anonymous page in a VM_MERGEABLE area, if CONFIG_KSM is enabled,
+ * the PAGE_MAPPING_KSM bit may be set along with the PAGE_MAPPING_ANON bit;
+ * and then page->mapping points, not to an anon_vma, but to a private
+ * structure which KSM associates with that merged page. See ksm.h.
+ *
+ * PAGE_MAPPING_KSM without PAGE_MAPPING_ANON is currently never used.
*
* Please note that, confusingly, "page_mapping" refers to the inode
* address_space which maps the page from disk; whereas "page_mapped"
* refers to user virtual address space into which the page is mapped.
*/
#define PAGE_MAPPING_ANON 1
+#define PAGE_MAPPING_KSM 2
+#define PAGE_MAPPING_FLAGS (PAGE_MAPPING_ANON | PAGE_MAPPING_KSM)

extern struct address_space swapper_space;
static inline struct address_space *page_mapping(struct page *page)
@@ -644,6 +653,12 @@ static inline struct address_space *page
return mapping;
}

+/* Neutral page->mapping pointer to address_space or anon_vma or other */
+static inline void *page_rmapping(struct page *page)
+{
+ return (void *)((unsigned long)page->mapping & ~PAGE_MAPPING_FLAGS);
+}
+
static inline int PageAnon(struct page *page)
{
return ((unsigned long)page->mapping & PAGE_MAPPING_ANON) != 0;
--- mm0/include/linux/rmap.h 2009-09-28 00:28:39.000000000 +0100
+++ mm1/include/linux/rmap.h 2009-11-04 10:52:45.000000000 +0000
@@ -39,6 +39,14 @@ struct anon_vma {

#ifdef CONFIG_MMU

+static inline struct anon_vma *page_anon_vma(struct page *page)
+{
+ if (((unsigned long)page->mapping & PAGE_MAPPING_FLAGS) !=
+ PAGE_MAPPING_ANON)
+ return NULL;
+ return page_rmapping(page);
+}
+
static inline void anon_vma_lock(struct vm_area_struct *vma)
{
struct anon_vma *anon_vma = vma->anon_vma;
--- mm0/mm/migrate.c 2009-11-02 12:32:34.000000000 +0000
+++ mm1/mm/migrate.c 2009-11-04 10:52:45.000000000 +0000
@@ -172,17 +172,14 @@ static void remove_anon_migration_ptes(s
{
struct anon_vma *anon_vma;
struct vm_area_struct *vma;
- unsigned long mapping;
-
- mapping = (unsigned long)new->mapping;
-
- if (!mapping || (mapping & PAGE_MAPPING_ANON) == 0)
- return;

/*
* We hold the mmap_sem lock. So no need to call page_lock_anon_vma.
*/
- anon_vma = (struct anon_vma *) (mapping - PAGE_MAPPING_ANON);
+ anon_vma = page_anon_vma(new);
+ if (!anon_vma)
+ return;
+
spin_lock(&anon_vma->lock);

list_for_each_entry(vma, &anon_vma->head, anon_vma_node)
--- mm0/mm/rmap.c 2009-11-02 12:32:34.000000000 +0000
+++ mm1/mm/rmap.c 2009-11-04 10:52:45.000000000 +0000
@@ -203,7 +203,7 @@ struct anon_vma *page_lock_anon_vma(stru

rcu_read_lock();
anon_mapping = (unsigned long) page->mapping;
- if (!(anon_mapping & PAGE_MAPPING_ANON))
+ if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
goto out;
if (!page_mapped(page))
goto out;
@@ -248,8 +248,7 @@ vma_address(struct page *page, struct vm
unsigned long page_address_in_vma(struct page *page, struct vm_area_struct *vma)
{
if (PageAnon(page)) {
- if ((void *)vma->anon_vma !=
- (void *)page->mapping - PAGE_MAPPING_ANON)
+ if (vma->anon_vma != page_anon_vma(page))
return -EFAULT;
} else if (page->mapping && !(vma->vm_flags & VM_NONLINEAR)) {
if (!vma->vm_file ||
@@ -512,7 +511,7 @@ int page_referenced(struct page *page,
referenced++;

*vm_flags = 0;
- if (page_mapped(page) && page->mapping) {
+ if (page_mapped(page) && page_rmapping(page)) {
if (PageAnon(page))
referenced += page_referenced_anon(page, mem_cont,
vm_flags);
--- mm0/mm/swapfile.c 2009-11-04 10:21:17.000000000 +0000
+++ mm1/mm/swapfile.c 2009-11-04 10:52:45.000000000 +0000
@@ -937,7 +937,7 @@ static int unuse_vma(struct vm_area_stru
unsigned long addr, end, next;
int ret;

- if (page->mapping) {
+ if (page_anon_vma(page)) {
addr = page_address_in_vma(page, vma);
if (addr == -EFAULT)
return 0;

2009-11-10 21:55:12

by Hugh Dickins

[permalink] [raw]

Subject: [PATCH 2/6] mm: mlocking in try_to_unmap_one

There's contorted mlock/munlock handling in try_to_unmap_anon() and
try_to_unmap_file(), which we'd prefer not to repeat for KSM swapping.
Simplify it by moving it all down into try_to_unmap_one().

One thing is then lost, try_to_munlock()'s distinction between when no
vma holds the page mlocked, and when a vma does mlock it, but we could
not get mmap_sem to set the page flag. But its only caller takes no
interest in that distinction (and is better testing SWAP_MLOCK anyway),
so let's keep the code simple and return SWAP_AGAIN for both cases.

try_to_unmap_file()'s TTU_MUNLOCK nonlinear handling was particularly
amusing: once unravelled, it turns out to have been choosing between
two different ways of doing the same nothing. Ah, no, one way was
actually returning SWAP_FAIL when it meant to return SWAP_SUCCESS.

Signed-off-by: Hugh Dickins <[email protected]>
---

mm/mlock.c | 2
mm/rmap.c | 107 ++++++++++++---------------------------------------
2 files changed, 27 insertions(+), 82 deletions(-)

--- mm1/mm/mlock.c 2009-09-28 00:28:41.000000000 +0100
+++ mm2/mm/mlock.c 2009-11-04 10:52:52.000000000 +0000
@@ -117,7 +117,7 @@ static void munlock_vma_page(struct page
/*
* did try_to_unlock() succeed or punt?
*/
- if (ret == SWAP_SUCCESS || ret == SWAP_AGAIN)
+ if (ret != SWAP_MLOCK)
count_vm_event(UNEVICTABLE_PGMUNLOCKED);

putback_lru_page(page);
--- mm1/mm/rmap.c 2009-11-04 10:52:45.000000000 +0000
+++ mm2/mm/rmap.c 2009-11-04 10:52:52.000000000 +0000
@@ -787,6 +787,8 @@ static int try_to_unmap_one(struct page
ret = SWAP_MLOCK;
goto out_unmap;
}
+ if (MLOCK_PAGES && TTU_ACTION(flags) == TTU_MUNLOCK)
+ goto out_unmap;
}
if (!(flags & TTU_IGNORE_ACCESS)) {
if (ptep_clear_flush_young_notify(vma, address, pte)) {
@@ -852,12 +854,22 @@ static int try_to_unmap_one(struct page
} else
dec_mm_counter(mm, file_rss);

-
page_remove_rmap(page);
page_cache_release(page);

out_unmap:
pte_unmap_unlock(pte, ptl);
+
+ if (MLOCK_PAGES && ret == SWAP_MLOCK) {
+ ret = SWAP_AGAIN;
+ if (down_read_trylock(&vma->vm_mm->mmap_sem)) {
+ if (vma->vm_flags & VM_LOCKED) {
+ mlock_vma_page(page);
+ ret = SWAP_MLOCK;
+ }
+ up_read(&vma->vm_mm->mmap_sem);
+ }
+ }
out:
return ret;
}
@@ -979,23 +991,6 @@ static int try_to_unmap_cluster(unsigned
return ret;
}

-/*
- * common handling for pages mapped in VM_LOCKED vmas
- */
-static int try_to_mlock_page(struct page *page, struct vm_area_struct *vma)
-{
- int mlocked = 0;
-
- if (down_read_trylock(&vma->vm_mm->mmap_sem)) {
- if (vma->vm_flags & VM_LOCKED) {
- mlock_vma_page(page);
- mlocked++; /* really mlocked the page */
- }
- up_read(&vma->vm_mm->mmap_sem);
- }
- return mlocked;
-}
-
/**
* try_to_unmap_anon - unmap or unlock anonymous page using the object-based
* rmap method
@@ -1016,42 +1011,19 @@ static int try_to_unmap_anon(struct page
{
struct anon_vma *anon_vma;
struct vm_area_struct *vma;
- unsigned int mlocked = 0;
int ret = SWAP_AGAIN;
- int unlock = TTU_ACTION(flags) == TTU_MUNLOCK;
-
- if (MLOCK_PAGES && unlikely(unlock))
- ret = SWAP_SUCCESS; /* default for try_to_munlock() */

anon_vma = page_lock_anon_vma(page);
if (!anon_vma)
return ret;

list_for_each_entry(vma, &anon_vma->head, anon_vma_node) {
- if (MLOCK_PAGES && unlikely(unlock)) {
- if (!((vma->vm_flags & VM_LOCKED) &&
- page_mapped_in_vma(page, vma)))
- continue; /* must visit all unlocked vmas */
- ret = SWAP_MLOCK; /* saw at least one mlocked vma */
- } else {
- ret = try_to_unmap_one(page, vma, flags);
- if (ret == SWAP_FAIL || !page_mapped(page))
- break;
- }
- if (ret == SWAP_MLOCK) {
- mlocked = try_to_mlock_page(page, vma);
- if (mlocked)
- break; /* stop if actually mlocked page */
- }
+ ret = try_to_unmap_one(page, vma, flags);
+ if (ret != SWAP_AGAIN || !page_mapped(page))
+ break;
}

page_unlock_anon_vma(anon_vma);
-
- if (mlocked)
- ret = SWAP_MLOCK; /* actually mlocked the page */
- else if (ret == SWAP_MLOCK)
- ret = SWAP_AGAIN; /* saw VM_LOCKED vma */
-
return ret;
}

@@ -1081,45 +1053,23 @@ static int try_to_unmap_file(struct page
unsigned long max_nl_cursor = 0;
unsigned long max_nl_size = 0;
unsigned int mapcount;
- unsigned int mlocked = 0;
- int unlock = TTU_ACTION(flags) == TTU_MUNLOCK;
-
- if (MLOCK_PAGES && unlikely(unlock))
- ret = SWAP_SUCCESS; /* default for try_to_munlock() */

spin_lock(&mapping->i_mmap_lock);
vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
- if (MLOCK_PAGES && unlikely(unlock)) {
- if (!((vma->vm_flags & VM_LOCKED) &&
- page_mapped_in_vma(page, vma)))
- continue; /* must visit all vmas */
- ret = SWAP_MLOCK;
- } else {
- ret = try_to_unmap_one(page, vma, flags);
- if (ret == SWAP_FAIL || !page_mapped(page))
- goto out;
- }
- if (ret == SWAP_MLOCK) {
- mlocked = try_to_mlock_page(page, vma);
- if (mlocked)
- break; /* stop if actually mlocked page */
- }
+ ret = try_to_unmap_one(page, vma, flags);
+ if (ret != SWAP_AGAIN || !page_mapped(page))
+ goto out;
}

- if (mlocked)
+ if (list_empty(&mapping->i_mmap_nonlinear))
goto out;

- if (list_empty(&mapping->i_mmap_nonlinear))
+ /* We don't bother to try to find the munlocked page in nonlinears */
+ if (MLOCK_PAGES && TTU_ACTION(flags) == TTU_MUNLOCK)
goto out;

list_for_each_entry(vma, &mapping->i_mmap_nonlinear,
shared.vm_set.list) {
- if (MLOCK_PAGES && unlikely(unlock)) {
- if (!(vma->vm_flags & VM_LOCKED))
- continue; /* must visit all vmas */
- ret = SWAP_MLOCK; /* leave mlocked == 0 */
- goto out; /* no need to look further */
- }
if (!MLOCK_PAGES && !(flags & TTU_IGNORE_MLOCK) &&
(vma->vm_flags & VM_LOCKED))
continue;
@@ -1161,10 +1111,9 @@ static int try_to_unmap_file(struct page
cursor = (unsigned long) vma->vm_private_data;
while ( cursor < max_nl_cursor &&
cursor < vma->vm_end - vma->vm_start) {
- ret = try_to_unmap_cluster(cursor, &mapcount,
- vma, page);
- if (ret == SWAP_MLOCK)
- mlocked = 2; /* to return below */
+ if (try_to_unmap_cluster(cursor, &mapcount,
+ vma, page) == SWAP_MLOCK)
+ ret = SWAP_MLOCK;
cursor += CLUSTER_SIZE;
vma->vm_private_data = (void *) cursor;
if ((int)mapcount <= 0)
@@ -1185,10 +1134,6 @@ static int try_to_unmap_file(struct page
vma->vm_private_data = NULL;
out:
spin_unlock(&mapping->i_mmap_lock);
- if (mlocked)
- ret = SWAP_MLOCK; /* actually mlocked the page */
- else if (ret == SWAP_MLOCK)
- ret = SWAP_AGAIN; /* saw VM_LOCKED vma */
return ret;
}

@@ -1231,7 +1176,7 @@ int try_to_unmap(struct page *page, enum
*
* Return values are:
*
- * SWAP_SUCCESS - no vma's holding page mlocked.
+ * SWAP_AGAIN - no vma is holding page mlocked, or,
* SWAP_AGAIN - page mapped in mlocked vma -- couldn't acquire mmap sem
* SWAP_MLOCK - page is now mlocked.
*/

2009-11-10 21:59:25

by Hugh Dickins

[permalink] [raw]

Subject: [PATCH 3/6] mm: CONFIG_MMU for PG_mlocked

Remove three degrees of obfuscation, left over from when we had
CONFIG_UNEVICTABLE_LRU. MLOCK_PAGES is CONFIG_HAVE_MLOCKED_PAGE_BIT
is CONFIG_HAVE_MLOCK is CONFIG_MMU. rmap.o (and memory-failure.o)
are only built when CONFIG_MMU, so don't need such conditions at all.

Somehow, I feel no compulsion to remove the CONFIG_HAVE_MLOCK*
lines from 169 defconfigs: leave those to evolve in due course.

Signed-off-by: Hugh Dickins <[email protected]>
---

include/linux/page-flags.h | 8 +++-----
mm/Kconfig | 8 --------
mm/internal.h | 26 ++++++++++++--------------
mm/memory-failure.c | 2 --
mm/page_alloc.c | 4 ----
mm/rmap.c | 17 +++++------------
6 files changed, 20 insertions(+), 45 deletions(-)

--- mm2/include/linux/page-flags.h 2009-11-02 12:32:34.000000000 +0000
+++ mm3/include/linux/page-flags.h 2009-11-04 10:52:58.000000000 +0000
@@ -99,7 +99,7 @@ enum pageflags {
PG_buddy, /* Page is free, on buddy lists */
PG_swapbacked, /* Page is backed by RAM/swap */
PG_unevictable, /* Page is "unevictable" */
-#ifdef CONFIG_HAVE_MLOCKED_PAGE_BIT
+#ifdef CONFIG_MMU
PG_mlocked, /* Page is vma mlocked */
#endif
#ifdef CONFIG_ARCH_USES_PG_UNCACHED
@@ -259,12 +259,10 @@ PAGEFLAG_FALSE(SwapCache)
PAGEFLAG(Unevictable, unevictable) __CLEARPAGEFLAG(Unevictable, unevictable)
TESTCLEARFLAG(Unevictable, unevictable)

-#ifdef CONFIG_HAVE_MLOCKED_PAGE_BIT
-#define MLOCK_PAGES 1
+#ifdef CONFIG_MMU
PAGEFLAG(Mlocked, mlocked) __CLEARPAGEFLAG(Mlocked, mlocked)
TESTSCFLAG(Mlocked, mlocked) __TESTCLEARFLAG(Mlocked, mlocked)
#else
-#define MLOCK_PAGES 0
PAGEFLAG_FALSE(Mlocked) SETPAGEFLAG_NOOP(Mlocked)
TESTCLEARFLAG_FALSE(Mlocked) __TESTCLEARFLAG_FALSE(Mlocked)
#endif
@@ -393,7 +391,7 @@ static inline void __ClearPageTail(struc

#endif /* !PAGEFLAGS_EXTENDED */

-#ifdef CONFIG_HAVE_MLOCKED_PAGE_BIT
+#ifdef CONFIG_MMU
#define __PG_MLOCKED (1 << PG_mlocked)
#else
#define __PG_MLOCKED 0
--- mm2/mm/Kconfig 2009-11-02 12:32:34.000000000 +0000
+++ mm3/mm/Kconfig 2009-11-04 10:52:58.000000000 +0000
@@ -203,14 +203,6 @@ config VIRT_TO_BUS
def_bool y
depends on !ARCH_NO_VIRT_TO_BUS

-config HAVE_MLOCK
- bool
- default y if MMU=y
-
-config HAVE_MLOCKED_PAGE_BIT
- bool
- default y if HAVE_MLOCK=y
-
config MMU_NOTIFIER
bool

--- mm2/mm/internal.h 2009-09-28 00:28:41.000000000 +0100
+++ mm3/mm/internal.h 2009-11-04 10:52:58.000000000 +0000
@@ -63,17 +63,6 @@ static inline unsigned long page_order(s
return page_private(page);
}

-#ifdef CONFIG_HAVE_MLOCK
-extern long mlock_vma_pages_range(struct vm_area_struct *vma,
- unsigned long start, unsigned long end);
-extern void munlock_vma_pages_range(struct vm_area_struct *vma,
- unsigned long start, unsigned long end);
-static inline void munlock_vma_pages_all(struct vm_area_struct *vma)
-{
- munlock_vma_pages_range(vma, vma->vm_start, vma->vm_end);
-}
-#endif
-
/*
* unevictable_migrate_page() called only from migrate_page_copy() to
* migrate unevictable flag to new page.
@@ -86,7 +75,16 @@ static inline void unevictable_migrate_p
SetPageUnevictable(new);
}

-#ifdef CONFIG_HAVE_MLOCKED_PAGE_BIT
+#ifdef CONFIG_MMU
+extern long mlock_vma_pages_range(struct vm_area_struct *vma,
+ unsigned long start, unsigned long end);
+extern void munlock_vma_pages_range(struct vm_area_struct *vma,
+ unsigned long start, unsigned long end);
+static inline void munlock_vma_pages_all(struct vm_area_struct *vma)
+{
+ munlock_vma_pages_range(vma, vma->vm_start, vma->vm_end);
+}
+
/*
* Called only in fault path via page_evictable() for a new page
* to determine if it's being mapped into a LOCKED vma.
@@ -144,7 +142,7 @@ static inline void mlock_migrate_page(st
}
}

-#else /* CONFIG_HAVE_MLOCKED_PAGE_BIT */
+#else /* !CONFIG_MMU */
static inline int is_mlocked_vma(struct vm_area_struct *v, struct page *p)
{
return 0;
@@ -153,7 +151,7 @@ static inline void clear_page_mlock(stru
static inline void mlock_vma_page(struct page *page) { }
static inline void mlock_migrate_page(struct page *new, struct page *old) { }

-#endif /* CONFIG_HAVE_MLOCKED_PAGE_BIT */
+#endif /* !CONFIG_MMU */

/*
* Return the mem_map entry representing the 'offset' subpage within
--- mm2/mm/memory-failure.c 2009-11-02 12:32:34.000000000 +0000
+++ mm3/mm/memory-failure.c 2009-11-04 10:52:58.000000000 +0000
@@ -582,10 +582,8 @@ static struct page_state {
{ unevict|dirty, unevict|dirty, "unevictable LRU", me_pagecache_dirty},
{ unevict, unevict, "unevictable LRU", me_pagecache_clean},

-#ifdef CONFIG_HAVE_MLOCKED_PAGE_BIT
{ mlock|dirty, mlock|dirty, "mlocked LRU", me_pagecache_dirty },
{ mlock, mlock, "mlocked LRU", me_pagecache_clean },
-#endif

{ lru|dirty, lru|dirty, "LRU", me_pagecache_dirty },
{ lru|dirty, lru, "clean LRU", me_pagecache_clean },
--- mm2/mm/page_alloc.c 2009-11-02 12:32:34.000000000 +0000
+++ mm3/mm/page_alloc.c 2009-11-04 10:52:58.000000000 +0000
@@ -487,7 +487,6 @@ static inline void __free_one_page(struc
zone->free_area[order].nr_free++;
}

-#ifdef CONFIG_HAVE_MLOCKED_PAGE_BIT
/*
* free_page_mlock() -- clean up attempts to free and mlocked() page.
* Page should not be on lru, so no need to fix that up.
@@ -503,9 +502,6 @@ static inline void free_page_mlock(struc
__dec_zone_page_state(page, NR_MLOCK);
__count_vm_event(UNEVICTABLE_MLOCKFREED);
}
-#else
-static void free_page_mlock(struct page *page) { }
-#endif

static inline int free_pages_check(struct page *page)
{
--- mm2/mm/rmap.c 2009-11-04 10:52:52.000000000 +0000
+++ mm3/mm/rmap.c 2009-11-04 10:52:58.000000000 +0000
@@ -787,7 +787,7 @@ static int try_to_unmap_one(struct page
ret = SWAP_MLOCK;
goto out_unmap;
}
- if (MLOCK_PAGES && TTU_ACTION(flags) == TTU_MUNLOCK)
+ if (TTU_ACTION(flags) == TTU_MUNLOCK)
goto out_unmap;
}
if (!(flags & TTU_IGNORE_ACCESS)) {
@@ -860,7 +860,7 @@ static int try_to_unmap_one(struct page
out_unmap:
pte_unmap_unlock(pte, ptl);

- if (MLOCK_PAGES && ret == SWAP_MLOCK) {
+ if (ret == SWAP_MLOCK) {
ret = SWAP_AGAIN;
if (down_read_trylock(&vma->vm_mm->mmap_sem)) {
if (vma->vm_flags & VM_LOCKED) {
@@ -937,11 +937,10 @@ static int try_to_unmap_cluster(unsigned
return ret;

/*
- * MLOCK_PAGES => feature is configured.
- * if we can acquire the mmap_sem for read, and vma is VM_LOCKED,
+ * If we can acquire the mmap_sem for read, and vma is VM_LOCKED,
* keep the sem while scanning the cluster for mlocking pages.
*/
- if (MLOCK_PAGES && down_read_trylock(&vma->vm_mm->mmap_sem)) {
+ if (down_read_trylock(&vma->vm_mm->mmap_sem)) {
locked_vma = (vma->vm_flags & VM_LOCKED);
if (!locked_vma)
up_read(&vma->vm_mm->mmap_sem); /* don't need it */
@@ -1065,14 +1064,11 @@ static int try_to_unmap_file(struct page
goto out;

/* We don't bother to try to find the munlocked page in nonlinears */
- if (MLOCK_PAGES && TTU_ACTION(flags) == TTU_MUNLOCK)
+ if (TTU_ACTION(flags) == TTU_MUNLOCK)
goto out;

list_for_each_entry(vma, &mapping->i_mmap_nonlinear,
shared.vm_set.list) {
- if (!MLOCK_PAGES && !(flags & TTU_IGNORE_MLOCK) &&
- (vma->vm_flags & VM_LOCKED))
- continue;
cursor = (unsigned long) vma->vm_private_data;
if (cursor > max_nl_cursor)
max_nl_cursor = cursor;
@@ -1105,9 +1101,6 @@ static int try_to_unmap_file(struct page
do {
list_for_each_entry(vma, &mapping->i_mmap_nonlinear,
shared.vm_set.list) {
- if (!MLOCK_PAGES && !(flags & TTU_IGNORE_MLOCK) &&
- (vma->vm_flags & VM_LOCKED))
- continue;
cursor = (unsigned long) vma->vm_private_data;
while ( cursor < max_nl_cursor &&
cursor < vma->vm_end - vma->vm_start) {

2009-11-10 22:00:41

by Hugh Dickins

[permalink] [raw]

Subject: [PATCH 4/6] mm: pass address down to rmap ones

KSM swapping will know where page_referenced_one() and try_to_unmap_one()
should look. It could hack page->index to get them to do what it wants,
but it seems cleaner now to pass the address down to them.

Make the same change to page_mkclean_one(), since it follows the same
pattern; but there's no real need in its case.

Signed-off-by: Hugh Dickins <[email protected]>
---

mm/rmap.c | 53 ++++++++++++++++++++++++++--------------------------
1 file changed, 27 insertions(+), 26 deletions(-)

--- mm3/mm/rmap.c 2009-11-04 10:52:58.000000000 +0000
+++ mm4/mm/rmap.c 2009-11-04 10:53:05.000000000 +0000
@@ -336,21 +336,15 @@ int page_mapped_in_vma(struct page *page
* Subfunctions of page_referenced: page_referenced_one called
* repeatedly from either page_referenced_anon or page_referenced_file.
*/
-static int page_referenced_one(struct page *page,
- struct vm_area_struct *vma,
- unsigned int *mapcount,
+static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
+ unsigned long address, unsigned int *mapcount,
unsigned long *vm_flags)
{
struct mm_struct *mm = vma->vm_mm;
- unsigned long address;
pte_t *pte;
spinlock_t *ptl;
int referenced = 0;

- address = vma_address(page, vma);
- if (address == -EFAULT)
- goto out;
-
pte = page_check_address(page, mm, address, &ptl, 0);
if (!pte)
goto out;
@@ -408,6 +402,9 @@ static int page_referenced_anon(struct p

mapcount = page_mapcount(page);
list_for_each_entry(vma, &anon_vma->head, anon_vma_node) {
+ unsigned long address = vma_address(page, vma);
+ if (address == -EFAULT)
+ continue;
/*
* If we are reclaiming on behalf of a cgroup, skip
* counting on behalf of references from different
@@ -415,7 +412,7 @@ static int page_referenced_anon(struct p
*/
if (mem_cont && !mm_match_cgroup(vma->vm_mm, mem_cont))
continue;
- referenced += page_referenced_one(page, vma,
+ referenced += page_referenced_one(page, vma, address,
&mapcount, vm_flags);
if (!mapcount)
break;
@@ -473,6 +470,9 @@ static int page_referenced_file(struct p
mapcount = page_mapcount(page);

vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
+ unsigned long address = vma_address(page, vma);
+ if (address == -EFAULT)
+ continue;
/*
* If we are reclaiming on behalf of a cgroup, skip
* counting on behalf of references from different
@@ -480,7 +480,7 @@ static int page_referenced_file(struct p
*/
if (mem_cont && !mm_match_cgroup(vma->vm_mm, mem_cont))
continue;
- referenced += page_referenced_one(page, vma,
+ referenced += page_referenced_one(page, vma, address,
&mapcount, vm_flags);
if (!mapcount)
break;
@@ -534,18 +534,14 @@ int page_referenced(struct page *page,
return referenced;
}

-static int page_mkclean_one(struct page *page, struct vm_area_struct *vma)
+static int page_mkclean_one(struct page *page, struct vm_area_struct *vma,
+ unsigned long address)
{
struct mm_struct *mm = vma->vm_mm;
- unsigned long address;
pte_t *pte;
spinlock_t *ptl;
int ret = 0;

- address = vma_address(page, vma);
- if (address == -EFAULT)
- goto out;
-
pte = page_check_address(page, mm, address, &ptl, 1);
if (!pte)
goto out;
@@ -577,8 +573,12 @@ static int page_mkclean_file(struct addr

spin_lock(&mapping->i_mmap_lock);
vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
- if (vma->vm_flags & VM_SHARED)
- ret += page_mkclean_one(page, vma);
+ if (vma->vm_flags & VM_SHARED) {
+ unsigned long address = vma_address(page, vma);
+ if (address == -EFAULT)
+ continue;
+ ret += page_mkclean_one(page, vma, address);
+ }
}
spin_unlock(&mapping->i_mmap_lock);
return ret;
@@ -760,19 +760,14 @@ void page_remove_rmap(struct page *page)
* repeatedly from either try_to_unmap_anon or try_to_unmap_file.
*/
static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
- enum ttu_flags flags)
+ unsigned long address, enum ttu_flags flags)
{
struct mm_struct *mm = vma->vm_mm;
- unsigned long address;
pte_t *pte;
pte_t pteval;
spinlock_t *ptl;
int ret = SWAP_AGAIN;

- address = vma_address(page, vma);
- if (address == -EFAULT)
- goto out;
-
pte = page_check_address(page, mm, address, &ptl, 0);
if (!pte)
goto out;
@@ -1017,7 +1012,10 @@ static int try_to_unmap_anon(struct page
return ret;

list_for_each_entry(vma, &anon_vma->head, anon_vma_node) {
- ret = try_to_unmap_one(page, vma, flags);
+ unsigned long address = vma_address(page, vma);
+ if (address == -EFAULT)
+ continue;
+ ret = try_to_unmap_one(page, vma, address, flags);
if (ret != SWAP_AGAIN || !page_mapped(page))
break;
}
@@ -1055,7 +1053,10 @@ static int try_to_unmap_file(struct page

spin_lock(&mapping->i_mmap_lock);
vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
- ret = try_to_unmap_one(page, vma, flags);
+ unsigned long address = vma_address(page, vma);
+ if (address == -EFAULT)
+ continue;
+ ret = try_to_unmap_one(page, vma, address, flags);
if (ret != SWAP_AGAIN || !page_mapped(page))
goto out;
}

2009-11-10 22:02:44

by Hugh Dickins

[permalink] [raw]

Subject: [PATCH 5/6] mm: stop ptlock enlarging struct page

CONFIG_DEBUG_SPINLOCK adds 12 or 16 bytes to a 32- or 64-bit spinlock_t,
and CONFIG_DEBUG_LOCK_ALLOC adds another 12 or 24 bytes to it: lockdep
enables both of those, and CONFIG_LOCK_STAT adds 8 or 16 bytes to that.

When 2.6.15 placed the split page table lock inside struct page (usually
sized 32 or 56 bytes), only CONFIG_DEBUG_SPINLOCK was a possibility, and
we ignored the enlargement (but fitted in CONFIG_GENERIC_LOCKBREAK's 4
by letting the spinlock_t occupy both page->private and page->mapping).

Should these debugging options be allowed to double the size of a struct
page, when only one minority use of the page (as a page table) needs to
fit a spinlock in there? Perhaps not.

Take the easy way out: switch off SPLIT_PTLOCK_CPUS when DEBUG_SPINLOCK
or DEBUG_LOCK_ALLOC is in force. I've sometimes tried to be cleverer,
kmallocing a cacheline for the spinlock when it doesn't fit, but given
up each time. Falling back to mm->page_table_lock (as we do when ptlock
is not split) lets lockdep check out the strictest path anyway.

And now that some arches allow 8192 cpus, use 999999 for infinity.

(What has this got to do with KSM swapping? It doesn't care about the
size of struct page, but may care about random junk in page->mapping -
to be explained separately later.)

Signed-off-by: Hugh Dickins <[email protected]>
---

mm/Kconfig | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)

--- mm4/mm/Kconfig 2009-11-04 10:52:58.000000000 +0000
+++ mm5/mm/Kconfig 2009-11-04 10:53:13.000000000 +0000
@@ -161,11 +161,13 @@ config PAGEFLAGS_EXTENDED
# Default to 4 for wider testing, though 8 might be more appropriate.
# ARM's adjust_pte (unused if VIPT) depends on mm-wide page_table_lock.
# PA-RISC 7xxx's spinlock_t would enlarge struct page from 32 to 44 bytes.
+# DEBUG_SPINLOCK and DEBUG_LOCK_ALLOC spinlock_t also enlarge struct page.
#
config SPLIT_PTLOCK_CPUS
int
- default "4096" if ARM && !CPU_CACHE_VIPT
- default "4096" if PARISC && !PA20
+ default "999999" if ARM && !CPU_CACHE_VIPT
+ default "999999" if PARISC && !PA20
+ default "999999" if DEBUG_SPINLOCK || DEBUG_LOCK_ALLOC
default "4"

#

2009-11-10 22:06:46

by Hugh Dickins

[permalink] [raw]

Subject: [PATCH 6/6] mm: sigbus instead of abusing oom

When do_nonlinear_fault() realizes that the page table must have been
corrupted for it to have been called, it does print_bad_pte() and
returns ... VM_FAULT_OOM, which is hard to understand.

It made some sense when I did it for 2.6.15, when do_page_fault()
just killed the current process; but nowadays it lets the OOM killer
decide who to kill - so page table corruption in one process would
be liable to kill another.

Change it to return VM_FAULT_SIGBUS instead: that doesn't guarantee
that the process will be killed, but is good enough for such a rare
abnormality, accompanied as it is by the "BUG: Bad page map" message.

And recent HWPOISON work has copied that code into do_swap_page(),
when it finds an impossible swap entry: fix that to VM_FAULT_SIGBUS too.

Signed-off-by: Hugh Dickins <[email protected]>
---
This one has nothing whatever to do with KSM swapping,
just something that KAMEZAWA-san and Minchan noticed recently.

mm/memory.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

--- mm5/mm/memory.c 2009-11-02 12:32:34.000000000 +0000
+++ mm6/mm/memory.c 2009-11-07 14:44:58.000000000 +0000
@@ -2529,7 +2529,7 @@ static int do_swap_page(struct mm_struct
ret = VM_FAULT_HWPOISON;
} else {
print_bad_pte(vma, address, orig_pte, NULL);
- ret = VM_FAULT_OOM;
+ ret = VM_FAULT_SIGBUS;
}
goto out;
}
@@ -2925,7 +2925,7 @@ static int do_nonlinear_fault(struct mm_
* Page table corrupted: show pte and kill process.
*/
print_bad_pte(vma, address, orig_pte, NULL);
- return VM_FAULT_OOM;
+ return VM_FAULT_SIGBUS;
}

pgoff = pte_to_pgoff(orig_pte);

2009-11-10 22:09:21

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH 5/6] mm: stop ptlock enlarging struct page

On Tue, 2009-11-10 at 22:02 +0000, Hugh Dickins wrote:
> CONFIG_DEBUG_SPINLOCK adds 12 or 16 bytes to a 32- or 64-bit spinlock_t,
> and CONFIG_DEBUG_LOCK_ALLOC adds another 12 or 24 bytes to it: lockdep
> enables both of those, and CONFIG_LOCK_STAT adds 8 or 16 bytes to that.
>
> When 2.6.15 placed the split page table lock inside struct page (usually
> sized 32 or 56 bytes), only CONFIG_DEBUG_SPINLOCK was a possibility, and
> we ignored the enlargement (but fitted in CONFIG_GENERIC_LOCKBREAK's 4
> by letting the spinlock_t occupy both page->private and page->mapping).
>
> Should these debugging options be allowed to double the size of a struct
> page, when only one minority use of the page (as a page table) needs to
> fit a spinlock in there? Perhaps not.
>
> Take the easy way out: switch off SPLIT_PTLOCK_CPUS when DEBUG_SPINLOCK
> or DEBUG_LOCK_ALLOC is in force. I've sometimes tried to be cleverer,
> kmallocing a cacheline for the spinlock when it doesn't fit, but given
> up each time. Falling back to mm->page_table_lock (as we do when ptlock
> is not split) lets lockdep check out the strictest path anyway.

Why? we know lockdep bloats stuff we never cared.. and hiding a popular
CONFIG option from lockdep doesn't seem like a good idea to me.

> Signed-off-by: Hugh Dickins <[email protected]>
> ---
>
> mm/Kconfig | 6 ++++--
> 1 file changed, 4 insertions(+), 2 deletions(-)
>
> --- mm4/mm/Kconfig 2009-11-04 10:52:58.000000000 +0000
> +++ mm5/mm/Kconfig 2009-11-04 10:53:13.000000000 +0000
> @@ -161,11 +161,13 @@ config PAGEFLAGS_EXTENDED
> # Default to 4 for wider testing, though 8 might be more appropriate.
> # ARM's adjust_pte (unused if VIPT) depends on mm-wide page_table_lock.
> # PA-RISC 7xxx's spinlock_t would enlarge struct page from 32 to 44 bytes.
> +# DEBUG_SPINLOCK and DEBUG_LOCK_ALLOC spinlock_t also enlarge struct page.
> #
> config SPLIT_PTLOCK_CPUS
> int
> - default "4096" if ARM && !CPU_CACHE_VIPT
> - default "4096" if PARISC && !PA20
> + default "999999" if ARM && !CPU_CACHE_VIPT
> + default "999999" if PARISC && !PA20
> + default "999999" if DEBUG_SPINLOCK || DEBUG_LOCK_ALLOC
> default "4"
>
> #

2009-11-10 22:14:43

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH 5/6] mm: stop ptlock enlarging struct page

fwiw, in -rt we carry this, because there spinlock_t is huge even
without lockdep.

---
commit 27909c87933670deead6ab74274cf61ebffad5ac
Author: Peter Zijlstra <[email protected]>
Date: Fri Jul 3 08:44:54 2009 -0500

mm: shrink the page frame to !-rt size

He below is a boot-tested hack to shrink the page frame size back to
normal.

Should be a net win since there should be many less PTE-pages than
page-frames.

Signed-off-by: Peter Zijlstra <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>

diff --git a/include/linux/mm.h b/include/linux/mm.h
index e52dfbb..fb2a7e9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -938,27 +938,85 @@ static inline pmd_t *pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long a
* overflow into the next struct page (as it might with DEBUG_SPINLOCK).
* When freeing, reset page->mapping so free_pages_check won't complain.
*/
+#ifndef CONFIG_PREEMPT_RT
+
#define __pte_lockptr(page) &((page)->ptl)
-#define pte_lock_init(_page) do { \
- spin_lock_init(__pte_lockptr(_page)); \
-} while (0)
+
+static inline struct page *pte_lock_init(struct page *page)
+{
+ spin_lock_init(__pte_lockptr(page));
+ return page;
+}
+
#define pte_lock_deinit(page) ((page)->mapping = NULL)
+
+#else /* PREEMPT_RT */
+
+/*
+ * On PREEMPT_RT the spinlock_t's are too large to embed in the
+ * page frame, hence it only has a pointer and we need to dynamically
+ * allocate the lock when we allocate PTE-pages.
+ *
+ * This is an overall win, since only a small fraction of the pages
+ * will be PTE pages under normal circumstances.
+ */
+
+#define __pte_lockptr(page) ((page)->ptl)
+
+/*
+ * Heinous hack, relies on the caller doing something like:
+ *
+ * pte = alloc_pages(PGALLOC_GFP, 0);
+ * if (pte)
+ * pgtable_page_ctor(pte);
+ * return pte;
+ *
+ * This ensures we release the page and return NULL when the
+ * lock allocation fails.
+ */
+static inline struct page *pte_lock_init(struct page *page)
+{
+ page->ptl = kmalloc(sizeof(spinlock_t), GFP_KERNEL);
+ if (page->ptl) {
+ spin_lock_init(__pte_lockptr(page));
+ } else {
+ __free_page(page);
+ page = NULL;
+ }
+ return page;
+}
+
+static inline void pte_lock_deinit(struct page *page)
+{
+ kfree(page->ptl);
+ page->mapping = NULL;
+}
+
+#endif /* PREEMPT_RT */
+
#define pte_lockptr(mm, pmd) ({(void)(mm); __pte_lockptr(pmd_page(*(pmd)));})
#else /* !USE_SPLIT_PTLOCKS */
/*
* We use mm->page_table_lock to guard all pagetable pages of the mm.
*/
-#define pte_lock_init(page) do {} while (0)
+static inline struct page *pte_lock_init(struct page *page) { return page; }
#define pte_lock_deinit(page) do {} while (0)
#define pte_lockptr(mm, pmd) ({(void)(pmd); &(mm)->page_table_lock;})
#endif /* USE_SPLIT_PTLOCKS */

-static inline void pgtable_page_ctor(struct page *page)
+static inline struct page *__pgtable_page_ctor(struct page *page)
{
- pte_lock_init(page);
- inc_zone_page_state(page, NR_PAGETABLE);
+ page = pte_lock_init(page);
+ if (page)
+ inc_zone_page_state(page, NR_PAGETABLE);
+ return page;
}

+#define pgtable_page_ctor(page) \
+do { \
+ page = __pgtable_page_ctor(page); \
+} while (0)
+
static inline void pgtable_page_dtor(struct page *page)
{
pte_lock_deinit(page);
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index bd79936..2b208da 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -69,7 +69,11 @@ struct page {
*/
};
#if USE_SPLIT_PTLOCKS
+#ifndef CONFIG_PREEMPT_RT
spinlock_t ptl;
+#else
+ spinlock_t *ptl;
+#endif
#endif
struct kmem_cache *slab; /* SLUB: Pointer to slab */
struct page *first_page; /* Compound tail pages */

2009-11-10 22:24:15

by Hugh Dickins

[permalink] [raw]

Subject: Re: [PATCH 5/6] mm: stop ptlock enlarging struct page

On Tue, 10 Nov 2009, Peter Zijlstra wrote:
> On Tue, 2009-11-10 at 22:02 +0000, Hugh Dickins wrote:
> >
> > Take the easy way out: switch off SPLIT_PTLOCK_CPUS when DEBUG_SPINLOCK
> > or DEBUG_LOCK_ALLOC is in force. I've sometimes tried to be cleverer,
> > kmallocing a cacheline for the spinlock when it doesn't fit, but given
> > up each time. Falling back to mm->page_table_lock (as we do when ptlock
> > is not split) lets lockdep check out the strictest path anyway.
>
> Why? we know lockdep bloats stuff we never cared.. and hiding a popular
> CONFIG option from lockdep doesn't seem like a good idea to me.

That's a fair opinion, and indeed I Cc'ed you in case it were yours.

I'd like to see how other people feel about it. Personally I detest
and regret that bloat to struct page, when there's only one particular
use of a page that remotely excuses it.

If it were less tiresome, I'd have gone for the dynamic kmalloc; but
it seemed silly to make that effort when the Kconfig mod is so easy.

But so far as letting lockdep do its job goes, we're actually better
off using page_table_lock there, as I tried to explain: since that
lock is used for a few other purposes, lockdep is more likely to
catch an issue which the SPLIT_PTLOCK case could be hiding.

Hugh

2009-11-10 22:29:29

by Hugh Dickins

[permalink] [raw]

Subject: Re: [PATCH 5/6] mm: stop ptlock enlarging struct page

On Tue, 10 Nov 2009, Peter Zijlstra wrote:
>
> fwiw, in -rt we carry this, because there spinlock_t is huge even
> without lockdep.

Thanks, I may want to consider that; but I'm not keen on darting
off to another cacheline even for the non-debug spinlock case.

Hugh

2009-11-11 01:22:08

by KOSAKI Motohiro

[permalink] [raw]

Subject: Re: [PATCH 3/6] mm: CONFIG_MMU for PG_mlocked

> Remove three degrees of obfuscation, left over from when we had
> CONFIG_UNEVICTABLE_LRU. MLOCK_PAGES is CONFIG_HAVE_MLOCKED_PAGE_BIT
> is CONFIG_HAVE_MLOCK is CONFIG_MMU. rmap.o (and memory-failure.o)
> are only built when CONFIG_MMU, so don't need such conditions at all.
>
> Somehow, I feel no compulsion to remove the CONFIG_HAVE_MLOCK*
> lines from 169 defconfigs: leave those to evolve in due course.
>
> Signed-off-by: Hugh Dickins <[email protected]>

I don't recall why Lee added this config option. but it seems very
reasonable and I storongly like it.

At least, vmscan folks never said "please try to disable CONFIG_MLOCK".
It mean this option didn't help our debug.

Reviewed-by: KOSAKI Motohiro <[email protected]>

2009-11-11 02:40:01

by Kamezawa Hiroyuki

[permalink] [raw]

Subject: Re: [PATCH 6/6] mm: sigbus instead of abusing oom

On Tue, 10 Nov 2009 22:06:49 +0000 (GMT)
Hugh Dickins <[email protected]> wrote:

> When do_nonlinear_fault() realizes that the page table must have been
> corrupted for it to have been called, it does print_bad_pte() and
> returns ... VM_FAULT_OOM, which is hard to understand.
>
> It made some sense when I did it for 2.6.15, when do_page_fault()
> just killed the current process; but nowadays it lets the OOM killer
> decide who to kill - so page table corruption in one process would
> be liable to kill another.
>
> Change it to return VM_FAULT_SIGBUS instead: that doesn't guarantee
> that the process will be killed, but is good enough for such a rare
> abnormality, accompanied as it is by the "BUG: Bad page map" message.
>
> And recent HWPOISON work has copied that code into do_swap_page(),
> when it finds an impossible swap entry: fix that to VM_FAULT_SIGBUS too.
>
> Signed-off-by: Hugh Dickins <[email protected]>

Thank you !
Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>

2009-11-11 02:42:05

by KOSAKI Motohiro

[permalink] [raw]

Subject: Re: [PATCH 6/6] mm: sigbus instead of abusing oom

> On Tue, 10 Nov 2009 22:06:49 +0000 (GMT)
> Hugh Dickins <[email protected]> wrote:
>
> > When do_nonlinear_fault() realizes that the page table must have been
> > corrupted for it to have been called, it does print_bad_pte() and
> > returns ... VM_FAULT_OOM, which is hard to understand.
> >
> > It made some sense when I did it for 2.6.15, when do_page_fault()
> > just killed the current process; but nowadays it lets the OOM killer
> > decide who to kill - so page table corruption in one process would
> > be liable to kill another.
> >
> > Change it to return VM_FAULT_SIGBUS instead: that doesn't guarantee
> > that the process will be killed, but is good enough for such a rare
> > abnormality, accompanied as it is by the "BUG: Bad page map" message.
> >
> > And recent HWPOISON work has copied that code into do_swap_page(),
> > when it finds an impossible swap entry: fix that to VM_FAULT_SIGBUS too.
> >
> > Signed-off-by: Hugh Dickins <[email protected]>
>
> Thank you !
> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>

Thank you, me too.

Reviewed-by: KOSAKI Motohiro <[email protected]>

2009-11-11 04:35:39

by Fengguang Wu

[permalink] [raw]

Subject: Re: [PATCH 6/6] mm: sigbus instead of abusing oom

On Wed, Nov 11, 2009 at 10:42:04AM +0800, KOSAKI Motohiro wrote:
> > On Tue, 10 Nov 2009 22:06:49 +0000 (GMT)
> > Hugh Dickins <[email protected]> wrote:
> >
> > > When do_nonlinear_fault() realizes that the page table must have been
> > > corrupted for it to have been called, it does print_bad_pte() and
> > > returns ... VM_FAULT_OOM, which is hard to understand.
> > >
> > > It made some sense when I did it for 2.6.15, when do_page_fault()
> > > just killed the current process; but nowadays it lets the OOM killer
> > > decide who to kill - so page table corruption in one process would
> > > be liable to kill another.
> > >
> > > Change it to return VM_FAULT_SIGBUS instead: that doesn't guarantee
> > > that the process will be killed, but is good enough for such a rare
> > > abnormality, accompanied as it is by the "BUG: Bad page map" message.
> > >
> > > And recent HWPOISON work has copied that code into do_swap_page(),
> > > when it finds an impossible swap entry: fix that to VM_FAULT_SIGBUS too.
> > >
> > > Signed-off-by: Hugh Dickins <[email protected]>
> >
> > Thank you !
> > Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
>
> Thank you, me too.
>
> Reviewed-by: KOSAKI Motohiro <[email protected]>

Thank you!

Reviewed-by: Wu Fengguang <[email protected]>

Some unrelated comments:

We observed that copy_to_user() on a hwpoison page would trigger 3
(duplicate) late kills (the last three lines below):

early kill:
[ 56.964041] virtual address 7fffcab7d000 found in vma
[ 56.964390] 7fffcab7d000 phys b4365000
[ 58.089254] Triggering MCE exception on CPU 0
[ 58.089563] Disabling lock debugging due to kernel taint
[ 58.089914] Machine check events logged
[ 58.090187] MCE exception done on CPU 0
[ 58.090462] MCE 0xb4365: page flags 0x100000000100068=uptodate,lru,active,mmap,anonymous,swapbacked count 1 mapcount 1
[ 58.091878] MCE 0xb4365: Killing copy_to_user_te:3768 early due to hardware memory corruption
[ 58.092425] MCE 0xb4365: dirty LRU page recovery: Recovered
late kill on copy_to_user():
[ 59.136331] Copy 4096 bytes to 00007fffcab7d000
[ 59.136641] MCE: Killing copy_to_user_te:3768 due to hardware memory corruption fault at 7fffcab7d000
[ 59.137231] MCE: Killing copy_to_user_te:3768 due to hardware memory corruption fault at 7fffcab7d000
[ 59.137812] MCE: Killing copy_to_user_te:3768 due to hardware memory corruption fault at 7fffcab7d001

And this patch does not affect it (somehow weird but harmless behavior).

Thanks,
Fengguang

2009-11-11 05:51:05

by Minchan Kim

[permalink] [raw]

Subject: Re: [PATCH 6/6] mm: sigbus instead of abusing oom

On Wed, Nov 11, 2009 at 7:06 AM, Hugh Dickins
<[email protected]> wrote:
> When do_nonlinear_fault() realizes that the page table must have been
> corrupted for it to have been called, it does print_bad_pte() and
> returns ... VM_FAULT_OOM, which is hard to understand.
>
> It made some sense when I did it for 2.6.15, when do_page_fault()
> just killed the current process; but nowadays it lets the OOM killer
> decide who to kill - so page table corruption in one process would
> be liable to kill another.
>
> Change it to return VM_FAULT_SIGBUS instead: that doesn't guarantee
> that the process will be killed, but is good enough for such a rare
> abnormality, accompanied as it is by the "BUG: Bad page map" message.
>
> And recent HWPOISON work has copied that code into do_swap_page(),
> when it finds an impossible swap entry: fix that to VM_FAULT_SIGBUS too.
>
> Signed-off-by: Hugh Dickins <[email protected]>
I already agreed this. :)
Reviewed-by: Minchan Kim <[email protected]>

--
Kind regards,
Minchan Kim

2009-11-11 07:56:27

by KOSAKI Motohiro

[permalink] [raw]

Subject: Re: [PATCH 2/6] mm: mlocking in try_to_unmap_one

Hi Hugh

> @@ -1081,45 +1053,23 @@ static int try_to_unmap_file(struct page
> unsigned long max_nl_cursor = 0;
> unsigned long max_nl_size = 0;
> unsigned int mapcount;
> - unsigned int mlocked = 0;
> - int unlock = TTU_ACTION(flags) == TTU_MUNLOCK;
> -
> - if (MLOCK_PAGES && unlikely(unlock))
> - ret = SWAP_SUCCESS; /* default for try_to_munlock() */
>
> spin_lock(&mapping->i_mmap_lock);
> vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
> - if (MLOCK_PAGES && unlikely(unlock)) {
> - if (!((vma->vm_flags & VM_LOCKED) &&
> - page_mapped_in_vma(page, vma)))
> - continue; /* must visit all vmas */
> - ret = SWAP_MLOCK;
> - } else {
> - ret = try_to_unmap_one(page, vma, flags);
> - if (ret == SWAP_FAIL || !page_mapped(page))
> - goto out;
> - }
> - if (ret == SWAP_MLOCK) {
> - mlocked = try_to_mlock_page(page, vma);
> - if (mlocked)
> - break; /* stop if actually mlocked page */
> - }
> + ret = try_to_unmap_one(page, vma, flags);
> + if (ret != SWAP_AGAIN || !page_mapped(page))
> + goto out;
> }
>
> - if (mlocked)
> + if (list_empty(&mapping->i_mmap_nonlinear))
> goto out;
>
> - if (list_empty(&mapping->i_mmap_nonlinear))
> + /* We don't bother to try to find the munlocked page in nonlinears */
> + if (MLOCK_PAGES && TTU_ACTION(flags) == TTU_MUNLOCK)
> goto out;

I have dumb question.
Does this shortcut exiting code makes any behavior change?

>
> list_for_each_entry(vma, &mapping->i_mmap_nonlinear,
> shared.vm_set.list) {
> - if (MLOCK_PAGES && unlikely(unlock)) {
> - if (!(vma->vm_flags & VM_LOCKED))
> - continue; /* must visit all vmas */
> - ret = SWAP_MLOCK; /* leave mlocked == 0 */
> - goto out; /* no need to look further */
> - }
> if (!MLOCK_PAGES && !(flags & TTU_IGNORE_MLOCK) &&
> (vma->vm_flags & VM_LOCKED))
> continue;
> @@ -1161,10 +1111,9 @@ static int try_to_unmap_file(struct page
> cursor = (unsigned long) vma->vm_private_data;
> while ( cursor < max_nl_cursor &&
> cursor < vma->vm_end - vma->vm_start) {
> - ret = try_to_unmap_cluster(cursor, &mapcount,
> - vma, page);
> - if (ret == SWAP_MLOCK)
> - mlocked = 2; /* to return below */
> + if (try_to_unmap_cluster(cursor, &mapcount,
> + vma, page) == SWAP_MLOCK)
> + ret = SWAP_MLOCK;
> cursor += CLUSTER_SIZE;
> vma->vm_private_data = (void *) cursor;
> if ((int)mapcount <= 0)
> @@ -1185,10 +1134,6 @@ static int try_to_unmap_file(struct page
> vma->vm_private_data = NULL;
> out:
> spin_unlock(&mapping->i_mmap_lock);
> - if (mlocked)
> - ret = SWAP_MLOCK; /* actually mlocked the page */
> - else if (ret == SWAP_MLOCK)
> - ret = SWAP_AGAIN; /* saw VM_LOCKED vma */
> return ret;
> }
>

2009-11-11 10:48:09

by Hugh Dickins

[permalink] [raw]

Subject: Re: [PATCH 3/6] mm: CONFIG_MMU for PG_mlocked

On Wed, 11 Nov 2009, KOSAKI Motohiro wrote:
> > Remove three degrees of obfuscation, left over from when we had
> > CONFIG_UNEVICTABLE_LRU. MLOCK_PAGES is CONFIG_HAVE_MLOCKED_PAGE_BIT
> > is CONFIG_HAVE_MLOCK is CONFIG_MMU. rmap.o (and memory-failure.o)
> > are only built when CONFIG_MMU, so don't need such conditions at all.
>
> I don't recall why Lee added this config option. but it seems very
> reasonable and I storongly like it.
>
> At least, vmscan folks never said "please try to disable CONFIG_MLOCK".
> It mean this option didn't help our debug.
>
> Reviewed-by: KOSAKI Motohiro <[email protected]>

Thanks. CONFIG_HAVE_MLOCKED_PAGE_BIT and CONFIG_HAVE_MLOCK were both
just internal, automatically defaulted options which the user never
saw (except in .config). I think they were there to sort out the
interdependencies between CONFIG_MMU and CONFIG_UNEVICTABLE_LRU,
and probably other historical issues while people decided whether
or not to go ahead with having a page bit for the thing. So no
user should notice their disappearance: removing them just makes
the code clearer, that's all.

Hugh

2009-11-11 11:36:15

by Hugh Dickins

[permalink] [raw]

Subject: Re: [PATCH 2/6] mm: mlocking in try_to_unmap_one

On Wed, 11 Nov 2009, KOSAKI Motohiro wrote:

Though it doesn't quite answer your question,
I'll just reinsert the last paragraph of my description here...

> > try_to_unmap_file()'s TTU_MUNLOCK nonlinear handling was particularly
> > amusing: once unravelled, it turns out to have been choosing between
> > two different ways of doing the same nothing. Ah, no, one way was
> > actually returning SWAP_FAIL when it meant to return SWAP_SUCCESS.

...
> > @@ -1081,45 +1053,23 @@ static int try_to_unmap_file(struct page
...
> >
> > - if (list_empty(&mapping->i_mmap_nonlinear))
> > + /* We don't bother to try to find the munlocked page in nonlinears */
> > + if (MLOCK_PAGES && TTU_ACTION(flags) == TTU_MUNLOCK)
> > goto out;
>
> I have dumb question.
> Does this shortcut exiting code makes any behavior change?

Not dumb. My intention was to make no behaviour change with any of
this patch; but in checking back before completing the description,
I suddenly realized that that shortcut intentionally avoids the

if (max_nl_size == 0) { /* all nonlinears locked or reserved ? */
ret = SWAP_FAIL;
goto out;
}

(which doesn't show up in the patch: you'll have to look at rmap.c),
which used to have the effect of try_to_munlock() returning SWAP_FAIL
in the case when there were one or more VM_NONLINEAR vmas of the file,
but none of them (and none of the covering linear vmas) VM_LOCKED.

That should have been a SWAP_SUCCESS case, or with my changes
another SWAP_AGAIN, either of which would make munlock_vma_page()
count_vm_event(UNEVICTABLE_PGMUNLOCKED);
which would be correct; but the SWAP_FAIL meant that count was not
incremented in this case.

Actually, I've double-fixed that, because I also changed
munlock_vma_page() to increment the count whenever ret != SWAP_MLOCK;
which seemed more appropriate, but would have been a no-op if
try_to_munlock() only returned SWAP_SUCCESS or SWAP_AGAIN or SWAP_MLOCK
as it claimed.

But I wasn't very inclined to boast of fixing that bug, since my testing
didn't give confidence that those /proc/vmstat unevictable_pgs_*lock*
counts are being properly maintained anyway - when I locked the same
pages in two vmas then unlocked them in both, I ended up with mlocked
bigger than munlocked (with or without my 2/6 patch); which I suspect
is wrong, but rather off my present course towards KSM swapping...

Hugh

2009-11-11 12:38:33

by Andi Kleen

[permalink] [raw]

Subject: Re: [PATCH 3/6] mm: CONFIG_MMU for PG_mlocked

On Tue, Nov 10, 2009 at 09:59:23PM +0000, Hugh Dickins wrote:
> Remove three degrees of obfuscation, left over from when we had
> CONFIG_UNEVICTABLE_LRU. MLOCK_PAGES is CONFIG_HAVE_MLOCKED_PAGE_BIT
> is CONFIG_HAVE_MLOCK is CONFIG_MMU. rmap.o (and memory-failure.o)
> are only built when CONFIG_MMU, so don't need such conditions at all.

Thanks. The memory-failure.c change looks good and indeeds
it's overall less confusing.
-Andi

2009-11-13 06:30:15

by KOSAKI Motohiro

[permalink] [raw]

Subject: Re: [PATCH 2/6] mm: mlocking in try_to_unmap_one

> @@ -787,6 +787,8 @@ static int try_to_unmap_one(struct page
> ret = SWAP_MLOCK;
> goto out_unmap;
> }
> + if (MLOCK_PAGES && TTU_ACTION(flags) == TTU_MUNLOCK)
> + goto out_unmap;
> }
> if (!(flags & TTU_IGNORE_ACCESS)) {
> if (ptep_clear_flush_young_notify(vma, address, pte)) {
> @@ -852,12 +854,22 @@ static int try_to_unmap_one(struct page
> } else
> dec_mm_counter(mm, file_rss);
>
> -
> page_remove_rmap(page);
> page_cache_release(page);
>
> out_unmap:
> pte_unmap_unlock(pte, ptl);
> +
> + if (MLOCK_PAGES && ret == SWAP_MLOCK) {
> + ret = SWAP_AGAIN;
> + if (down_read_trylock(&vma->vm_mm->mmap_sem)) {
> + if (vma->vm_flags & VM_LOCKED) {
> + mlock_vma_page(page);
> + ret = SWAP_MLOCK;
> + }
> + up_read(&vma->vm_mm->mmap_sem);
> + }
> + }
> out:

Very small nit. How about this?

------------------------------------------------------------
>From 9d4b507572eccf88dcaa02e650df59874216528c Mon Sep 17 00:00:00 2001
From: KOSAKI Motohiro <[email protected]>
Date: Fri, 13 Nov 2009 15:00:04 +0900
Subject: [PATCH] Simplify try_to_unmap_one()

SWAP_MLOCK mean "We marked the page as PG_MLOCK, please move it to
unevictable-lru". So, following code is easy confusable.

if (vma->vm_flags & VM_LOCKED) {
ret = SWAP_MLOCK;
goto out_unmap;
}

Plus, if the VMA doesn't have VM_LOCKED, We don't need to check
the needed of calling mlock_vma_page().

Signed-off-by: KOSAKI Motohiro <[email protected]>
---
mm/rmap.c | 25 ++++++++++++-------------
1 files changed, 12 insertions(+), 13 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index 4440a86..81a168c 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -784,10 +784,8 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
* skipped over this mm) then we should reactivate it.
*/
if (!(flags & TTU_IGNORE_MLOCK)) {
- if (vma->vm_flags & VM_LOCKED) {
- ret = SWAP_MLOCK;
- goto out_unmap;
- }
+ if (vma->vm_flags & VM_LOCKED)
+ goto out_unlock;
if (MLOCK_PAGES && TTU_ACTION(flags) == TTU_MUNLOCK)
goto out_unmap;
}
@@ -856,18 +854,19 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,

out_unmap:
pte_unmap_unlock(pte, ptl);
+out:
+ return ret;

- if (MLOCK_PAGES && ret == SWAP_MLOCK) {
- ret = SWAP_AGAIN;
- if (down_read_trylock(&vma->vm_mm->mmap_sem)) {
- if (vma->vm_flags & VM_LOCKED) {
- mlock_vma_page(page);
- ret = SWAP_MLOCK;
- }
- up_read(&vma->vm_mm->mmap_sem);
+out_unlock:
+ pte_unmap_unlock(pte, ptl);
+
+ if (down_read_trylock(&vma->vm_mm->mmap_sem)) {
+ if (vma->vm_flags & VM_LOCKED) {
+ mlock_vma_page(page);
+ ret = SWAP_MLOCK;
}
+ up_read(&vma->vm_mm->mmap_sem);
}
-out:
return ret;
}

--
1.6.2.5

2009-11-13 08:16:33

by KOSAKI Motohiro

[permalink] [raw]

Subject: Re: [PATCH 2/6] mm: mlocking in try_to_unmap_one

> On Wed, 11 Nov 2009, KOSAKI Motohiro wrote:
>
> Though it doesn't quite answer your question,
> I'll just reinsert the last paragraph of my description here...
>
> > > try_to_unmap_file()'s TTU_MUNLOCK nonlinear handling was particularly
> > > amusing: once unravelled, it turns out to have been choosing between
> > > two different ways of doing the same nothing. Ah, no, one way was
> > > actually returning SWAP_FAIL when it meant to return SWAP_SUCCESS.
>
> ...
> > > @@ -1081,45 +1053,23 @@ static int try_to_unmap_file(struct page
> ...
> > >
> > > - if (list_empty(&mapping->i_mmap_nonlinear))
> > > + /* We don't bother to try to find the munlocked page in nonlinears */
> > > + if (MLOCK_PAGES && TTU_ACTION(flags) == TTU_MUNLOCK)
> > > goto out;
> >
> > I have dumb question.
> > Does this shortcut exiting code makes any behavior change?
>
> Not dumb. My intention was to make no behaviour change with any of
> this patch; but in checking back before completing the description,
> I suddenly realized that that shortcut intentionally avoids the
>
> if (max_nl_size == 0) { /* all nonlinears locked or reserved ? */
> ret = SWAP_FAIL;
> goto out;
> }
>
> (which doesn't show up in the patch: you'll have to look at rmap.c),
> which used to have the effect of try_to_munlock() returning SWAP_FAIL
> in the case when there were one or more VM_NONLINEAR vmas of the file,
> but none of them (and none of the covering linear vmas) VM_LOCKED.
>
> That should have been a SWAP_SUCCESS case, or with my changes
> another SWAP_AGAIN, either of which would make munlock_vma_page()
> count_vm_event(UNEVICTABLE_PGMUNLOCKED);
> which would be correct; but the SWAP_FAIL meant that count was not
> incremented in this case.

Ah, correct.
Then, we lost the capability unevictability of non linear mapping pages, right.
if so, following additional patch makes more consistent?

>
> Actually, I've double-fixed that, because I also changed
> munlock_vma_page() to increment the count whenever ret != SWAP_MLOCK;
> which seemed more appropriate, but would have been a no-op if
> try_to_munlock() only returned SWAP_SUCCESS or SWAP_AGAIN or SWAP_MLOCK
> as it claimed.
>
> But I wasn't very inclined to boast of fixing that bug, since my testing
> didn't give confidence that those /proc/vmstat unevictable_pgs_*lock*
> counts are being properly maintained anyway - when I locked the same
> pages in two vmas then unlocked them in both, I ended up with mlocked
> bigger than munlocked (with or without my 2/6 patch); which I suspect
> is wrong, but rather off my present course towards KSM swapping...

Ah, vmstat inconsistent is weird. I'll try to debug it later.
Thanks this notice.

----------------------------------
>From 3fd3bc58dc6505af73ecf92c981609ecf8b6ac40 Mon Sep 17 00:00:00 2001
From: KOSAKI Motohiro <[email protected]>
Date: Fri, 13 Nov 2009 16:52:03 +0900
Subject: [PATCH] [RFC] mm: non linear mapping page don't mark as PG_mlocked

Now, try_to_unmap_file() lost the capability to treat VM_NONLINEAR.
Then, mlock() shouldn't mark the page of non linear mapping as
PG_mlocked. Otherwise the page continue to drinker walk between
evictable and unevictable lru.

Signed-off-by: KOSAKI Motohiro <[email protected]>
---
mm/mlock.c | 37 +++++++++++++++++++++++--------------
1 files changed, 23 insertions(+), 14 deletions(-)

diff --git a/mm/mlock.c b/mm/mlock.c
index 48691fb..4187f9c 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -266,25 +266,34 @@ long mlock_vma_pages_range(struct vm_area_struct *vma,
if (vma->vm_flags & (VM_IO | VM_PFNMAP))
goto no_mlock;

- if (!((vma->vm_flags & (VM_DONTEXPAND | VM_RESERVED)) ||
+ if ((vma->vm_flags & (VM_DONTEXPAND | VM_RESERVED)) ||
is_vm_hugetlb_page(vma) ||
- vma == get_gate_vma(current))) {
+ vma == get_gate_vma(current)) {
+
+ /*
+ * User mapped kernel pages or huge pages:
+ * make these pages present to populate the ptes, but
+ * fall thru' to reset VM_LOCKED--no need to unlock, and
+ * return nr_pages so these don't get counted against task's
+ * locked limit. huge pages are already counted against
+ * locked vm limit.
+ */
+ make_pages_present(start, end);
+ goto no_mlock;
+ }

+ if (vma->vm_flags & VM_NONLINEAR)
+ /*
+ * try_to_munmap() doesn't treat VM_NONLINEAR. let's make
+ * consist.
+ */
+ make_pages_present(start, end);
+ else
__mlock_vma_pages_range(vma, start, end);

- /* Hide errors from mmap() and other callers */
- return 0;
- }
+ /* Hide errors from mmap() and other callers */
+ return 0;

- /*
- * User mapped kernel pages or huge pages:
- * make these pages present to populate the ptes, but
- * fall thru' to reset VM_LOCKED--no need to unlock, and
- * return nr_pages so these don't get counted against task's
- * locked limit. huge pages are already counted against
- * locked vm limit.
- */
- make_pages_present(start, end);

no_mlock:
vma->vm_flags &= ~VM_LOCKED; /* and don't come back! */
--
1.6.2.5

2009-11-13 08:26:17

by KOSAKI Motohiro

[permalink] [raw]

Subject: Re: [PATCH 2/6] mm: mlocking in try_to_unmap_one

> > On Wed, 11 Nov 2009, KOSAKI Motohiro wrote:
> >
> > Though it doesn't quite answer your question,
> > I'll just reinsert the last paragraph of my description here...
> >
> > > > try_to_unmap_file()'s TTU_MUNLOCK nonlinear handling was particularly
> > > > amusing: once unravelled, it turns out to have been choosing between
> > > > two different ways of doing the same nothing. Ah, no, one way was
> > > > actually returning SWAP_FAIL when it meant to return SWAP_SUCCESS.
> >
> > ...
> > > > @@ -1081,45 +1053,23 @@ static int try_to_unmap_file(struct page
> > ...
> > > >
> > > > - if (list_empty(&mapping->i_mmap_nonlinear))
> > > > + /* We don't bother to try to find the munlocked page in nonlinears */
> > > > + if (MLOCK_PAGES && TTU_ACTION(flags) == TTU_MUNLOCK)
> > > > goto out;
> > >
> > > I have dumb question.
> > > Does this shortcut exiting code makes any behavior change?
> >
> > Not dumb. My intention was to make no behaviour change with any of
> > this patch; but in checking back before completing the description,
> > I suddenly realized that that shortcut intentionally avoids the
> >
> > if (max_nl_size == 0) { /* all nonlinears locked or reserved ? */
> > ret = SWAP_FAIL;
> > goto out;
> > }
> >
> > (which doesn't show up in the patch: you'll have to look at rmap.c),
> > which used to have the effect of try_to_munlock() returning SWAP_FAIL
> > in the case when there were one or more VM_NONLINEAR vmas of the file,
> > but none of them (and none of the covering linear vmas) VM_LOCKED.
> >
> > That should have been a SWAP_SUCCESS case, or with my changes
> > another SWAP_AGAIN, either of which would make munlock_vma_page()
> > count_vm_event(UNEVICTABLE_PGMUNLOCKED);
> > which would be correct; but the SWAP_FAIL meant that count was not
> > incremented in this case.
>
> Ah, correct.
> Then, we lost the capability unevictability of non linear mapping pages, right.
> if so, following additional patch makes more consistent?

[indistinct muttering]

Probably we can remove VM_NONLINEAR perfectly. I've never seen real user of it.

2009-11-13 11:50:37

by Andrea Arcangeli

[permalink] [raw]

Subject: Re: [PATCH 2/6] mm: mlocking in try_to_unmap_one

On Fri, Nov 13, 2009 at 05:26:14PM +0900, KOSAKI Motohiro wrote:
> Probably we can remove VM_NONLINEAR perfectly. I've never seen real user of it.

Do you mean as a whole or in the mlock logic? databases are using
remap_file_pages on 32bit archs to avoid generating zillon of vmas on
tmpfs scattered mappings. On 64bits it could only be useful to some
emulators but with real shadow paging and nonlinear rmap already
created on shadow pagetables, it looks pretty useless on 64bit archs
to me.

2009-11-13 18:01:23

by KOSAKI Motohiro

[permalink] [raw]

Subject: Re: [PATCH 2/6] mm: mlocking in try_to_unmap_one

> On Fri, Nov 13, 2009 at 05:26:14PM +0900, KOSAKI Motohiro wrote:
> > Probably we can remove VM_NONLINEAR perfectly. I've never seen real user of it.
>
> Do you mean as a whole or in the mlock logic? databases are using
> remap_file_pages on 32bit archs to avoid generating zillon of vmas on
> tmpfs scattered mappings. On 64bits it could only be useful to some
> emulators but with real shadow paging and nonlinear rmap already
> created on shadow pagetables, it looks pretty useless on 64bit archs
> to me.

Hehe, you point out kosaki is stupid and knoledgeless.
thanks correct me.

Probaby we have to maintain VM_NONLINEAR for one or two year. two years
later, nobody use database on 32bit machine.
(low-end DB might be use on 32bit, but that's ok. it doesn't use VM_NONLINEAR)

2009-11-15 22:17:00

by Hugh Dickins

[permalink] [raw]

Subject: Re: [PATCH 2/6] mm: mlocking in try_to_unmap_one

On Fri, 13 Nov 2009, KOSAKI Motohiro wrote:
>
> Very small nit. How about this?

Yes, that takes it a stage further, I prefer that, thanks: but better
redo against mmotm, I removed the "MLOCK_PAGES && " in a later patch.

Hugh

>
>
> ------------------------------------------------------------
> From 9d4b507572eccf88dcaa02e650df59874216528c Mon Sep 17 00:00:00 2001
> From: KOSAKI Motohiro <[email protected]>
> Date: Fri, 13 Nov 2009 15:00:04 +0900
> Subject: [PATCH] Simplify try_to_unmap_one()
>
> SWAP_MLOCK mean "We marked the page as PG_MLOCK, please move it to
> unevictable-lru". So, following code is easy confusable.
>
> if (vma->vm_flags & VM_LOCKED) {
> ret = SWAP_MLOCK;
> goto out_unmap;
> }
>
> Plus, if the VMA doesn't have VM_LOCKED, We don't need to check
> the needed of calling mlock_vma_page().
>
> Signed-off-by: KOSAKI Motohiro <[email protected]>
> ---
> mm/rmap.c | 25 ++++++++++++-------------
> 1 files changed, 12 insertions(+), 13 deletions(-)
>
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 4440a86..81a168c 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -784,10 +784,8 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> * skipped over this mm) then we should reactivate it.
> */
> if (!(flags & TTU_IGNORE_MLOCK)) {
> - if (vma->vm_flags & VM_LOCKED) {
> - ret = SWAP_MLOCK;
> - goto out_unmap;
> - }
> + if (vma->vm_flags & VM_LOCKED)
> + goto out_unlock;
> if (MLOCK_PAGES && TTU_ACTION(flags) == TTU_MUNLOCK)
> goto out_unmap;
> }
> @@ -856,18 +854,19 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>
> out_unmap:
> pte_unmap_unlock(pte, ptl);
> +out:
> + return ret;
>
> - if (MLOCK_PAGES && ret == SWAP_MLOCK) {
> - ret = SWAP_AGAIN;
> - if (down_read_trylock(&vma->vm_mm->mmap_sem)) {
> - if (vma->vm_flags & VM_LOCKED) {
> - mlock_vma_page(page);
> - ret = SWAP_MLOCK;
> - }
> - up_read(&vma->vm_mm->mmap_sem);
> +out_unlock:
> + pte_unmap_unlock(pte, ptl);
> +
> + if (down_read_trylock(&vma->vm_mm->mmap_sem)) {
> + if (vma->vm_flags & VM_LOCKED) {
> + mlock_vma_page(page);
> + ret = SWAP_MLOCK;
> }
> + up_read(&vma->vm_mm->mmap_sem);
> }
> -out:
> return ret;
> }
>
> --
> 1.6.2.5

2009-11-15 22:37:39

by Hugh Dickins

[permalink] [raw]

Subject: Re: [PATCH 2/6] mm: mlocking in try_to_unmap_one

On Fri, 13 Nov 2009, KOSAKI Motohiro wrote:
> if so, following additional patch makes more consistent?
> ----------------------------------
> From 3fd3bc58dc6505af73ecf92c981609ecf8b6ac40 Mon Sep 17 00:00:00 2001
> From: KOSAKI Motohiro <[email protected]>
> Date: Fri, 13 Nov 2009 16:52:03 +0900
> Subject: [PATCH] [RFC] mm: non linear mapping page don't mark as PG_mlocked
>
> Now, try_to_unmap_file() lost the capability to treat VM_NONLINEAR.

Now?
Genuine try_to_unmap_file() deals with VM_NONLINEAR (including VM_LOCKED)
much as it always did, I think. But try_to_munlock() on a VM_NONLINEAR
has not being doing anything useful, I assume ever since it was added,
but haven't checked the history.

But so what? try_to_munlock() has those down_read_trylock()s which make
it never quite reliable. In the VM_NONLINEAR case it has simply been
giving up rather more easily.

> Then, mlock() shouldn't mark the page of non linear mapping as
> PG_mlocked. Otherwise the page continue to drinker walk between
> evictable and unevictable lru.

I do like your phrase "drinker walk". But is it really worse than
the lazy discovery of the page being locked, which is how I thought
this stuff was originally supposed to work anyway. I presume cases
were found in which the counts got so far out that it was a problem?

I liked the lazy discovery much better than trying to keep count;
can we just accept that VM_NONLINEAR may leave the counts further
away from exactitude?

I don't think this patch makes things more consistent, really.
It does make sys_remap_file_pages on an mlocked area inconsistent
with mlock on a sys_remap_file_pages area, doesn't it?

Hugh

>
> Signed-off-by: KOSAKI Motohiro <[email protected]>
> ---
> mm/mlock.c | 37 +++++++++++++++++++++++--------------
> 1 files changed, 23 insertions(+), 14 deletions(-)
>
> diff --git a/mm/mlock.c b/mm/mlock.c
> index 48691fb..4187f9c 100644
> --- a/mm/mlock.c
> +++ b/mm/mlock.c
> @@ -266,25 +266,34 @@ long mlock_vma_pages_range(struct vm_area_struct *vma,
> if (vma->vm_flags & (VM_IO | VM_PFNMAP))
> goto no_mlock;
>
> - if (!((vma->vm_flags & (VM_DONTEXPAND | VM_RESERVED)) ||
> + if ((vma->vm_flags & (VM_DONTEXPAND | VM_RESERVED)) ||
> is_vm_hugetlb_page(vma) ||
> - vma == get_gate_vma(current))) {
> + vma == get_gate_vma(current)) {
> +
> + /*
> + * User mapped kernel pages or huge pages:
> + * make these pages present to populate the ptes, but
> + * fall thru' to reset VM_LOCKED--no need to unlock, and
> + * return nr_pages so these don't get counted against task's
> + * locked limit. huge pages are already counted against
> + * locked vm limit.
> + */
> + make_pages_present(start, end);
> + goto no_mlock;
> + }
>
> + if (vma->vm_flags & VM_NONLINEAR)
> + /*
> + * try_to_munmap() doesn't treat VM_NONLINEAR. let's make
> + * consist.
> + */
> + make_pages_present(start, end);
> + else
> __mlock_vma_pages_range(vma, start, end);
>
> - /* Hide errors from mmap() and other callers */
> - return 0;
> - }
> + /* Hide errors from mmap() and other callers */
> + return 0;
>
> - /*
> - * User mapped kernel pages or huge pages:
> - * make these pages present to populate the ptes, but
> - * fall thru' to reset VM_LOCKED--no need to unlock, and
> - * return nr_pages so these don't get counted against task's
> - * locked limit. huge pages are already counted against
> - * locked vm limit.
> - */
> - make_pages_present(start, end);
>
> no_mlock:
> vma->vm_flags &= ~VM_LOCKED; /* and don't come back! */
> --
> 1.6.2.5

2009-11-16 23:34:47

by KOSAKI Motohiro

[permalink] [raw]

Subject: Re: [PATCH 2/6] mm: mlocking in try_to_unmap_one

> On Fri, 13 Nov 2009, KOSAKI Motohiro wrote:
> >
> > Very small nit. How about this?
>
> Yes, that takes it a stage further, I prefer that, thanks: but better
> redo against mmotm, I removed the "MLOCK_PAGES && " in a later patch.

Ah, yes.
thanks.

I'll redo.

2009-11-17 02:00:44

by KOSAKI Motohiro

[permalink] [raw]

Subject: Re: [PATCH 2/6] mm: mlocking in try_to_unmap_one

> On Fri, 13 Nov 2009, KOSAKI Motohiro wrote:
> > if so, following additional patch makes more consistent?
> > ----------------------------------
> > From 3fd3bc58dc6505af73ecf92c981609ecf8b6ac40 Mon Sep 17 00:00:00 2001
> > From: KOSAKI Motohiro <[email protected]>
> > Date: Fri, 13 Nov 2009 16:52:03 +0900
> > Subject: [PATCH] [RFC] mm: non linear mapping page don't mark as PG_mlocked
> >
> > Now, try_to_unmap_file() lost the capability to treat VM_NONLINEAR.
>
> Now?
> Genuine try_to_unmap_file() deals with VM_NONLINEAR (including VM_LOCKED)
> much as it always did, I think. But try_to_munlock() on a VM_NONLINEAR
> has not being doing anything useful, I assume ever since it was added,
> but haven't checked the history.
>
> But so what? try_to_munlock() has those down_read_trylock()s which make
> it never quite reliable. In the VM_NONLINEAR case it has simply been
> giving up rather more easily.

I catched your point, maybe. thanks, correct me. I agree your lazy
discovery method.

So, Can we add more kindly comment? (see below)

> > Then, mlock() shouldn't mark the page of non linear mapping as
> > PG_mlocked. Otherwise the page continue to drinker walk between
> > evictable and unevictable lru.
>
> I do like your phrase "drinker walk". But is it really worse than
> the lazy discovery of the page being locked, which is how I thought
> this stuff was originally supposed to work anyway. I presume cases
> were found in which the counts got so far out that it was a problem?
>
> I liked the lazy discovery much better than trying to keep count;
> can we just accept that VM_NONLINEAR may leave the counts further
> away from exactitude?
>
> I don't think this patch makes things more consistent, really.
> It does make sys_remap_file_pages on an mlocked area inconsistent
> with mlock on a sys_remap_file_pages area, doesn't it?

you are right.

>From 7332f765dbaa1fbfe48cf8d53b20048f7f8105e0 Mon Sep 17 00:00:00 2001
From: KOSAKI Motohiro <[email protected]>
Date: Tue, 17 Nov 2009 10:46:51 +0900
Subject: comment adding to mlocking in try_to_unmap_one

Current code doesn't tell us why we don't bother to nonlinear kindly.
This patch added small adding explanation.

Signed-off-by: KOSAKI Motohiro <[email protected]>
---
mm/rmap.c | 6 +++++-
1 files changed, 5 insertions(+), 1 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index 81a168c..c631407 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1061,7 +1061,11 @@ static int try_to_unmap_file(struct page *page, enum ttu_flags flags)
if (list_empty(&mapping->i_mmap_nonlinear))
goto out;

- /* We don't bother to try to find the munlocked page in nonlinears */
+ /*
+ * We don't bother to try to find the munlocked page in nonlinears.
+ * It's costly. Instead, later, page reclaim logic may call
+ * try_to_unmap(TTU_MUNLOCK) and recover PG_mlocked lazily.
+ */
if (MLOCK_PAGES && TTU_ACTION(flags) == TTU_MUNLOCK)
goto out;

--
1.6.2.5

2009-11-18 16:33:01

by Hugh Dickins

[permalink] [raw]

Subject: Re: [PATCH 2/6] mm: mlocking in try_to_unmap_one

On Tue, 17 Nov 2009, KOSAKI Motohiro wrote:
>
> From 7332f765dbaa1fbfe48cf8d53b20048f7f8105e0 Mon Sep 17 00:00:00 2001
> From: KOSAKI Motohiro <[email protected]>
> Date: Tue, 17 Nov 2009 10:46:51 +0900
> Subject: comment adding to mlocking in try_to_unmap_one
>
> Current code doesn't tell us why we don't bother to nonlinear kindly.
> This patch added small adding explanation.
>
> Signed-off-by: KOSAKI Motohiro <[email protected]>

Acked-by: Hugh Dickins <[email protected]>

(if the "MLOCK_PAGES && " is removed from this one too)

> ---
> mm/rmap.c | 6 +++++-
> 1 files changed, 5 insertions(+), 1 deletions(-)
>
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 81a168c..c631407 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1061,7 +1061,11 @@ static int try_to_unmap_file(struct page *page, enum ttu_flags flags)
> if (list_empty(&mapping->i_mmap_nonlinear))
> goto out;
>
> - /* We don't bother to try to find the munlocked page in nonlinears */
> + /*
> + * We don't bother to try to find the munlocked page in nonlinears.
> + * It's costly. Instead, later, page reclaim logic may call
> + * try_to_unmap(TTU_MUNLOCK) and recover PG_mlocked lazily.
> + */
> if (MLOCK_PAGES && TTU_ACTION(flags) == TTU_MUNLOCK)
> goto out;
>
> --
> 1.6.2.5

2009-11-19 00:26:37

by Rik van Riel

[permalink] [raw]

Subject: Re: [PATCH 1/6] mm: define PAGE_MAPPING_FLAGS

On 11/10/2009 04:51 PM, Hugh Dickins wrote:
> At present we define PageAnon(page) by the low PAGE_MAPPING_ANON bit
> set in page->mapping, with the higher bits a pointer to the anon_vma;
> and have defined PageKsm(page) as that with NULL anon_vma.
>
> But KSM swapping will need to store a pointer there: so in preparation
> for that, now define PAGE_MAPPING_FLAGS as the low two bits, including
> PAGE_MAPPING_KSM (always set along with PAGE_MAPPING_ANON, until some
> other use for the bit emerges).
>
> Declare page_rmapping(page) to return the pointer part of page->mapping,
> and page_anon_vma(page) to return the anon_vma pointer when that's what
> it is. Use these in a few appropriate places: notably, unuse_vma() has
> been testing page->mapping, but is better to be testing page_anon_vma()
> (cases may be added in which flag bits are set without any pointer).
>
> Signed-off-by: Hugh Dickins<[email protected]>
>
>
Reviewed-by: Rik van Riel <[email protected]>