Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757616AbXLRVYY (ORCPT ); Tue, 18 Dec 2007 16:24:24 -0500 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1754543AbXLRVSC (ORCPT ); Tue, 18 Dec 2007 16:18:02 -0500 Received: from mx1.redhat.com ([66.187.233.31]:58156 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753270AbXLRVRk (ORCPT ); Tue, 18 Dec 2007 16:17:40 -0500 Message-Id: <20071218211550.186819416@redhat.com> References: <20071218211539.250334036@redhat.com> User-Agent: quilt/0.46-1 Date: Tue, 18 Dec 2007 16:15:56 -0500 From: Rik van Riel To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, lee.shermerhorn@hp.com, Lee Schermerhorn Subject: [patch 17/20] non-reclaimable mlocked pages Content-Disposition: inline; filename=noreclaim-04.1-prepare-for-mlocked-pages.patch Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 30279 Lines: 930 V2 -> V3: + rebase to 23-mm1 atop RvR's split lru series + fix page flags macros for *PageMlocked() when not configured. + ensure lru_add_drain_all() runs on all cpus when NORECLIM_MLOCK configured. Was just for NUMA. V1 -> V2: + moved this patch [and related patches] up to right after ramdisk/ramfs and SHM_LOCKed patches. + add [back] missing put_page() in putback_lru_page(). This solved page leakage as seen by stats in previous version. + fix up munlock_vma_page() to isolate page from lru before calling try_to_unlock(). Think I detected a race here. + use TestClearPageMlock() on old page in migrate.c's migrate_page_copy() to clean up old page. + live dangerously: remove TestSetPageLocked() in is_mlocked_vma()--should only be called on new pages in the fault path--iff we chose to cull there [later patch]. + Add PG_mlocked to free_pages_check() etc to detect mlock state mismanagement. NOTE: temporarily [???] commented out--tripping over it under load. Why? Rework of a patch by Nick Piggin -- part 1 of 2. This patch: 1) defines the [CONFIG_]NORECLAIM_MLOCK sub-option and the stub version of the mlock/noreclaim APIs when it's not configured. Depends on [CONFIG_]NORECLAIM. 2) add yet another page flag--PG_mlocked--to indicate that the page is locked for efficient testing in vmscan and, optionally, fault path. This allows early culling of nonreclaimable pages, preventing them from getting to page_referenced()/try_to_unmap(). Also allows separate accounting of mlock'd pages, as Nick's original patch did. Uses a bit available only to 64-bit systems. Note: Nick's original mlock patch used a PG_mlocked flag. I had removed this in favor of the PG_noreclaim flag + an mlock_count [new page struct member]. I restored the PG_mlocked flag to eliminate the new count field. 3) add the mlock/noreclaim infrastructure to mm/mlock.c, with internal APIs in mm/internal.h. This is a rework of Nick's original patch to these files, taking into account that mlocked pages are now kept on noreclaim LRU list. 4) update vmscan.c:page_reclaimable() to check PageMlocked() and, if vma passed in, the vm_flags. Note that the vma will only be passed in for new pages in the fault path; and then only if the "cull nonreclaimable pages in fault path" patch is included. 5) add try_to_unlock() to rmap.c to walk a page's rmap and ClearPageMlocked() if no other vmas have it mlocked. Reuses as much of try_to_unmap() as possible. This effectively replaces the use of one of the lru list links as an mlock count. If this mechanism let's pages in mlocked vmas leak through w/o PG_mlocked set [I don't know that it does], we should catch them later in try_to_unmap(). One hopes this will be rare, as it will be relatively expensive. mm/internal.h and mm/mlock.c changes: Originally Signed-off-by: Nick Piggin Signed-off-by: Lee Schermerhorn Signed-off-by: Rik van Riel Index: linux-2.6.24-rc4-mm1/mm/Kconfig =================================================================== --- linux-2.6.24-rc4-mm1.orig/mm/Kconfig +++ linux-2.6.24-rc4-mm1/mm/Kconfig @@ -204,3 +204,17 @@ config NORECLAIM may be non-reclaimable because: they are locked into memory, they are anonymous pages for which no swap space exists, or they are anon pages that are expensive to unmap [long anon_vma "related vma" list.] + +config NORECLAIM_MLOCK + bool "Exclude mlock'ed pages from reclaim" + depends on NORECLAIM + help + Treats mlock'ed pages as no-reclaimable. Removing these pages from + the LRU [in]active lists avoids the overhead of attempting to reclaim + them. Pages marked non-reclaimable for this reason will become + reclaimable again when the last mlock is removed. + when no swap space exists. Removing these pages from the LRU lists + avoids the overhead of attempting to reclaim them. Pages marked + non-reclaimable for this reason will become reclaimable again when/if + sufficient swap space is added to the system. + Index: linux-2.6.24-rc4-mm1/mm/internal.h =================================================================== --- linux-2.6.24-rc4-mm1.orig/mm/internal.h +++ linux-2.6.24-rc4-mm1/mm/internal.h @@ -36,6 +36,60 @@ static inline void __put_page(struct pag extern int isolate_lru_page(struct page *page); +#ifdef CONFIG_NORECLAIM_MLOCK +/* + * called only for new pages in fault path + */ +extern int is_mlocked_vma(struct vm_area_struct *, struct page *); + +/* + * must be called with vma's mmap_sem held for read, and page locked. + */ +extern void mlock_vma_page(struct page *page); + +extern int __mlock_vma_pages_range(struct vm_area_struct *vma, + unsigned long start, unsigned long end, int lock); + +/* + * mlock all pages in this vma range. For mmap()/mremap()/... + */ +static inline void mlock_vma_pages_range(struct vm_area_struct *vma, + unsigned long start, unsigned long end) +{ + __mlock_vma_pages_range(vma, start, end, 1); +} + +/* + * munlock range of pages. For munmap() and exit(). + * Always called to operate on a full vma that is being unmapped. + */ +static inline void munlock_vma_pages_range(struct vm_area_struct *vma, + unsigned long start, unsigned long end) +{ +// TODO: verify my assumption. Should we just drop the start/end args? + VM_BUG_ON(start != vma->vm_start || end != vma->vm_end); + + vma->vm_flags &= ~VM_LOCKED; /* try_to_unlock() needs this */ + __mlock_vma_pages_range(vma, start, end, 0); +} + +extern void clear_page_mlock(struct page *page); + +#else /* CONFIG_NORECLAIM_MLOCK */ +static inline int is_mlocked_vma(struct vm_area_struct *v, struct page *p) +{ + return 0; +} +static inline void clear_page_mlock(struct page *page) { } +static inline void mlock_vma_page(struct page *page) { } +static inline void mlock_vma_pages_range(struct vm_area_struct *vma, + unsigned long start, unsigned long end) { } +static inline void munlock_vma_pages_range(struct vm_area_struct *vma, + unsigned long start, unsigned long end) { } + +#endif /* CONFIG_NORECLAIM_MLOCK */ + + extern void fastcall __init __free_pages_bootmem(struct page *page, unsigned int order); Index: linux-2.6.24-rc4-mm1/mm/mlock.c =================================================================== --- linux-2.6.24-rc4-mm1.orig/mm/mlock.c +++ linux-2.6.24-rc4-mm1/mm/mlock.c @@ -8,10 +8,16 @@ #include #include #include +#include +#include #include #include #include #include +#include +#include + +#include "internal.h" int can_do_mlock(void) { @@ -23,19 +29,224 @@ int can_do_mlock(void) } EXPORT_SYMBOL(can_do_mlock); +#ifdef CONFIG_NORECLAIM_MLOCK +/* + * Mlocked pages are marked with PageMlocked() flag for efficient testing + * in vmscan and, possibly, the fault path. + * + * An mlocked page [PageMlocked(page)] is non-reclaimable. As such, it will + * be placed on the LRU "noreclaim" list, rather than the [in]active lists. + * The noreclaim list is an LRU sibling list to the [in]active lists. + * PageNoreclaim is set to indicate the non-reclaimable state. + * +//TODO: no longer counting, but does this still apply to lazy setting +// of PageMlocked() ?? + * When lazy incrementing via vmscan, it is important to ensure that the + * vma's VM_LOCKED status is not concurrently being modified, otherwise we + * may have elevated mlock_count of a page that is being munlocked. So lazy + * mlocked must take the mmap_sem for read, and verify that the vma really + * is locked (see mm/rmap.c). + */ + +/* + * add isolated page to appropriate LRU list, adjusting stats as needed. + * Page may still be non-reclaimable for other reasons. +//TODO: move to vmscan.c as global along with isolate_lru_page()? + */ +static void putback_lru_page(struct page *page) +{ + VM_BUG_ON(PageLRU(page)); + + ClearPageNoreclaim(page); + ClearPageActive(page); + lru_cache_add_active_or_noreclaim(page, NULL); + put_page(page); /* ref from isolate */ +} + +/* + * Clear the page's PageMlocked(). This can be useful in a situation where + * we want to unconditionally remove a page from the pagecache. + * + * It is legal to call this function for any page, mlocked or not. + * If called for a page that is still mapped by mlocked vmas, all we do + * is revert to lazy LRU behaviour -- semantics are not broken. + */ +void clear_page_mlock(struct page *page) +{ + BUG_ON(!PageLocked(page)); + + if (likely(!PageMlocked(page))) + return; + ClearPageMlocked(page); + if (!isolate_lru_page(page)) + putback_lru_page(page); +} + +/* + * Mark page as mlocked if not already. + * If page on LRU, isolate and putback to move to noreclaim list. + */ +void mlock_vma_page(struct page *page) +{ + BUG_ON(!PageLocked(page)); + + if (!TestSetPageMlocked(page) && !isolate_lru_page(page)) + putback_lru_page(page); +} + +/* + * called from munlock()/munmap() path with page supposedly on the LRU. + * + * Note: unlike mlock_vma_page(), we can't just clear the PageMlocked + * [in try_to_unlock()] and then attempt to isolate the page. We must + * isolate the page() to keep others from messing with its noreclaim + * and mlocked state while trying to unlock. However, we pre-clear the + * mlocked state anyway as we might lose the isolation race and we might + * not get another chance to clear PageMlocked. If we successfully + * isolate the page and try_to_unlock() detects other VM_LOCKED vmas + * mapping the page, we just restore the PageMlocked state. If we lose + * the isolation race, and the page is mapped by other VM_LOCKED vmas, + * we'll detect this in try_to_unmap() and we'll call mlock_vma_page() + * above, if/when we try to reclaim the page. + */ +static void munlock_vma_page(struct page *page) +{ + BUG_ON(!PageLocked(page)); + + if (TestClearPageMlocked(page) && !isolate_lru_page(page)) { + if (try_to_unlock(page) == SWAP_MLOCK) + SetPageMlocked(page); /* still VM_LOCKED */ + putback_lru_page(page); + } +} + +/* + * Called in fault path via page_reclaimable() for a new page + * to determine if it's being mapped into a LOCKED vma. + * If so, mark page as mlocked. + */ +int is_mlocked_vma(struct vm_area_struct *vma, struct page *page) +{ + VM_BUG_ON(PageMlocked(page)); // TODO: needed? + VM_BUG_ON(PageLRU(page)); + + if (likely(!(vma->vm_flags & VM_LOCKED))) + return 0; + + SetPageMlocked(page); + return 1; +} + +/* + * mlock or munlock a range of pages in the vma depending on whether + * @lock is 1 or 0, respectively. @lock must match vm_flags VM_LOCKED + * state. +TODO: we don't really need @lock, as we can determine it from vm_flags + * + * This takes care of making the pages present too. + * + * vma->vm_mm->mmap_sem must be held for write. + */ +int __mlock_vma_pages_range(struct vm_area_struct *vma, + unsigned long start, unsigned long end, int lock) +{ + struct mm_struct *mm = vma->vm_mm; + unsigned long addr = start; + struct page *pages[16]; /* 16 gives a reasonable batch */ + int write = !!(vma->vm_flags & VM_WRITE); + int nr_pages; + int ret = 0; + + BUG_ON(start & ~PAGE_MASK || end & ~PAGE_MASK); + VM_BUG_ON(lock != !!(vma->vm_flags & VM_LOCKED)); + + if (vma->vm_flags & VM_IO) + return ret; + + lru_add_drain_all(); /* push cached pages to LRU */ + + nr_pages = (end - start) / PAGE_SIZE; + + while (nr_pages > 0) { + int i; + + cond_resched(); + + /* + * get_user_pages makes pages present if we are + * setting mlock. + */ + ret = get_user_pages(current, mm, addr, + min_t(int, nr_pages, ARRAY_SIZE(pages)), + write, 0, pages, NULL); + if (ret < 0) + break; + if (ret == 0) { + /* + * We know the vma is there, so the only time + * we cannot get a single page should be an + * error (ret < 0) case. + */ + WARN_ON(1); + ret = -EFAULT; + break; + } + + lru_add_drain(); /* push cached pages to LRU */ + + for (i = 0; i < ret; i++) { + struct page *page = pages[i]; + + lock_page(page); + if (lock) + mlock_vma_page(page); + else + munlock_vma_page(page); + unlock_page(page); + put_page(page); /* ref from get_user_pages() */ + + addr += PAGE_SIZE; /* for next get_user_pages() */ + nr_pages--; + } + } + + lru_add_drain_all(); /* to update stats */ + + return ret; +} + +#else /* CONFIG_NORECLAIM_MLOCK */ + +/* + * Just make pages present if @lock true. No-op if unlocking. + */ +int __mlock_vma_pages_range(struct vm_area_struct *vma, + unsigned long start, unsigned long end, int lock) +{ + int ret = 0; + + if (!lock || vma->vm_flags & VM_IO) + return ret; + + return make_pages_present(start, end); +} +#endif /* CONFIG_NORECLAIM_MLOCK */ + static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev, unsigned long start, unsigned long end, unsigned int newflags) { - struct mm_struct * mm = vma->vm_mm; + struct mm_struct *mm = vma->vm_mm; pgoff_t pgoff; - int pages; + int nr_pages; int ret = 0; + int lock; if (newflags == vma->vm_flags) { *prev = vma; goto out; } +//TODO: linear_page_index() ? non-linear pages? pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT); *prev = vma_merge(mm, *prev, start, end, newflags, vma->anon_vma, vma->vm_file, pgoff, vma_policy(vma)); @@ -59,24 +270,25 @@ static int mlock_fixup(struct vm_area_st } success: + lock = !!(newflags & VM_LOCKED); + + /* + * Keep track of amount of locked VM. + */ + nr_pages = (end - start) >> PAGE_SHIFT; + if (!lock) + nr_pages = -nr_pages; + mm->locked_vm += nr_pages; + /* * vm_flags is protected by the mmap_sem held in write mode. * It's okay if try_to_unmap_one unmaps a page just after we - * set VM_LOCKED, make_pages_present below will bring it back. + * set VM_LOCKED, __mlock_vma_pages_range will bring it back. */ vma->vm_flags = newflags; - /* - * Keep track of amount of locked VM. - */ - pages = (end - start) >> PAGE_SHIFT; - if (newflags & VM_LOCKED) { - pages = -pages; - if (!(newflags & VM_IO)) - ret = make_pages_present(start, end); - } + __mlock_vma_pages_range(vma, start, end, lock); - mm->locked_vm -= pages; out: if (ret == -ENOMEM) ret = -EAGAIN; Index: linux-2.6.24-rc4-mm1/mm/vmscan.c =================================================================== --- linux-2.6.24-rc4-mm1.orig/mm/vmscan.c +++ linux-2.6.24-rc4-mm1/mm/vmscan.c @@ -2238,10 +2238,11 @@ int zone_reclaim(struct zone *zone, gfp_ * * @page - page to test * @vma - vm area in which page is/will be mapped. May be NULL. - * If !NULL, called from fault path. + * If !NULL, called from fault path for a new page. * * Reasons page might not be reclaimable: - * + page's mapping marked non-reclaimable + * 1) page's mapping marked non-reclaimable + * 2) page is mlock'ed into memory. * TODO - later patches * * TODO: specify locking assumptions @@ -2254,6 +2255,11 @@ int page_reclaimable(struct page *page, if (mapping_non_reclaimable(page_mapping(page))) return 0; +#ifdef CONFIG_NORECLAIM_MLOCK + if (PageMlocked(page) || (vma && is_mlocked_vma(vma, page))) + return 0; +#endif + /* TODO: test page [!]reclaimable conditions */ return 1; Index: linux-2.6.24-rc4-mm1/include/linux/page-flags.h =================================================================== --- linux-2.6.24-rc4-mm1.orig/include/linux/page-flags.h +++ linux-2.6.24-rc4-mm1/include/linux/page-flags.h @@ -110,6 +110,7 @@ #define PG_uncached 31 /* Page has been mapped as uncached */ #define PG_noreclaim 30 /* Page is "non-reclaimable" */ +#define PG_mlocked 29 /* Page is vma mlocked */ #endif /* @@ -163,6 +164,7 @@ static inline void SetPageUptodate(struc #define SetPageActive(page) set_bit(PG_active, &(page)->flags) #define ClearPageActive(page) clear_bit(PG_active, &(page)->flags) #define __ClearPageActive(page) __clear_bit(PG_active, &(page)->flags) +#define TestSetPageActive(page) test_and_set_bit(PG_active, &(page)->flags) #define TestClearPageActive(page) test_and_clear_bit(PG_active, &(page)->flags) #define PageSlab(page) test_bit(PG_slab, &(page)->flags) @@ -270,8 +272,17 @@ static inline void __ClearPageTail(struc #define SetPageNoreclaim(page) set_bit(PG_noreclaim, &(page)->flags) #define ClearPageNoreclaim(page) clear_bit(PG_noreclaim, &(page)->flags) #define __ClearPageNoreclaim(page) __clear_bit(PG_noreclaim, &(page)->flags) -#define TestClearPageNoreclaim(page) test_and_clear_bit(PG_noreclaim, \ - &(page)->flags) +#define TestClearPageNoreclaim(page) \ + test_and_clear_bit(PG_noreclaim, &(page)->flags) +#ifdef CONFIG_NORECLAIM_MLOCK +#define PageMlocked(page) test_bit(PG_mlocked, &(page)->flags) +#define SetPageMlocked(page) set_bit(PG_mlocked, &(page)->flags) +#define ClearPageMlocked(page) clear_bit(PG_mlocked, &(page)->flags) +#define __ClearPageMlocked(page) __clear_bit(PG_mlocked, &(page)->flags) +#define TestSetPageMlocked(page) test_and_set_bit(PG_mlocked, &(page)->flags) +#define TestClearPageMlocked(page) \ + test_and_clear_bit(PG_mlocked, &(page)->flags) +#endif #else #define PageNoreclaim(page) 0 #define SetPageNoreclaim(page) @@ -279,6 +290,14 @@ static inline void __ClearPageTail(struc #define __ClearPageNoreclaim(page) #define TestClearPageNoreclaim(page) 0 #endif +#ifndef CONFIG_NORECLAIM_MLOCK +#define PageMlocked(page) 0 +#define SetPageMlocked(page) +#define ClearPageMlocked(page) +#define __ClearPageMlocked(page) +#define TestSetPageMlocked(page) 0 +#define TestClearPageMlocked(page) 0 +#endif #define PageUncached(page) test_bit(PG_uncached, &(page)->flags) #define SetPageUncached(page) set_bit(PG_uncached, &(page)->flags) Index: linux-2.6.24-rc4-mm1/include/linux/rmap.h =================================================================== --- linux-2.6.24-rc4-mm1.orig/include/linux/rmap.h +++ linux-2.6.24-rc4-mm1/include/linux/rmap.h @@ -112,6 +112,17 @@ unsigned long page_address_in_vma(struct */ int page_mkclean(struct page *); +#ifdef CONFIG_NORECLAIM_MLOCK +/* + * called in munlock()/munmap() path to check for other vmas holding + * the page mlocked. + */ +int try_to_unlock(struct page *); +#define TRY_TO_UNLOCK 1 +#else +#define TRY_TO_UNLOCK 0 /* for compiler -- dead code elimination */ +#endif + #else /* !CONFIG_MMU */ #define anon_vma_init() do {} while (0) @@ -135,5 +146,6 @@ static inline int page_mkclean(struct pa #define SWAP_SUCCESS 0 #define SWAP_AGAIN 1 #define SWAP_FAIL 2 +#define SWAP_MLOCK 3 #endif /* _LINUX_RMAP_H */ Index: linux-2.6.24-rc4-mm1/mm/rmap.c =================================================================== --- linux-2.6.24-rc4-mm1.orig/mm/rmap.c +++ linux-2.6.24-rc4-mm1/mm/rmap.c @@ -52,6 +52,8 @@ #include +#include "internal.h" + struct kmem_cache *anon_vma_cachep; /* This must be called under the mmap_sem. */ @@ -284,10 +286,17 @@ static int page_referenced_one(struct pa if (!pte) goto out; + /* + * Don't want to elevate referenced for mlocked page that gets this far, + * in order that it progresses to try_to_unmap and is moved to the + * noreclaim list. + */ if (vma->vm_flags & VM_LOCKED) { - referenced++; *mapcount = 1; /* break early from loop */ - } else if (ptep_clear_flush_young(vma, address, pte)) + goto out_unmap; + } + + if (ptep_clear_flush_young(vma, address, pte)) referenced++; /* Pretend the page is referenced if the task has the @@ -296,6 +305,7 @@ static int page_referenced_one(struct pa rwsem_is_locked(&mm->mmap_sem)) referenced++; +out_unmap: (*mapcount)--; pte_unmap_unlock(pte, ptl); out: @@ -384,11 +394,6 @@ static int page_referenced_file(struct p */ if (mem_cont && (mm_cgroup(vma->vm_mm) != mem_cont)) continue; - if ((vma->vm_flags & (VM_LOCKED|VM_MAYSHARE)) - == (VM_LOCKED|VM_MAYSHARE)) { - referenced++; - break; - } referenced += page_referenced_one(page, vma, &mapcount); if (!mapcount) break; @@ -712,10 +717,15 @@ static int try_to_unmap_one(struct page * If it's recently referenced (perhaps page_referenced * skipped over this mm) then we should reactivate it. */ - if (!migration && ((vma->vm_flags & VM_LOCKED) || - (ptep_clear_flush_young(vma, address, pte)))) { - ret = SWAP_FAIL; - goto out_unmap; + if (!migration) { + if (vma->vm_flags & VM_LOCKED) { + ret = SWAP_MLOCK; + goto out_unmap; + } + if (ptep_clear_flush_young(vma, address, pte)) { + ret = SWAP_FAIL; + goto out_unmap; + } } /* Nuke the page table entry. */ @@ -797,6 +807,10 @@ out: * For very sparsely populated VMAs this is a little inefficient - chances are * there there won't be many ptes located within the scan cluster. In this case * maybe we could scan further - to the end of the pte page, perhaps. + * +TODO: still accurate with noreclaim infrastructure? + * Mlocked pages also aren't handled very well at the moment: they aren't + * moved off the LRU like they are for linear pages. */ #define CLUSTER_SIZE min(32*PAGE_SIZE, PMD_SIZE) #define CLUSTER_MASK (~(CLUSTER_SIZE - 1)) @@ -868,10 +882,28 @@ static void try_to_unmap_cluster(unsigne pte_unmap_unlock(pte - 1, ptl); } -static int try_to_unmap_anon(struct page *page, int migration) +/** + * try_to_unmap_anon - unmap or unlock anonymous page using the object-based + * rmap method + * @page: the page to unmap/unlock + * @unlock: request for unlock rather than unmap [unlikely] + * @migration: unmapping for migration - ignored if @unlock + * + * Find all the mappings of a page using the mapping pointer and the vma chains + * contained in the anon_vma struct it points to. + * + * This function is only called from try_to_unmap/try_to_unlock for + * anonymous pages. + * When called from try_to_unlock(), the mmap_sem of the mm containing the vma + * where the page was found will be held for write. So, we won't recheck + * vm_flags for that VMA. That should be OK, because that vma shouldn't be + * 'LOCKED. + */ +static int try_to_unmap_anon(struct page *page, int unlock, int migration) { struct anon_vma *anon_vma; struct vm_area_struct *vma; + unsigned int mlocked = 0; int ret = SWAP_AGAIN; anon_vma = page_lock_anon_vma(page); @@ -879,25 +911,53 @@ static int try_to_unmap_anon(struct page return ret; list_for_each_entry(vma, &anon_vma->head, anon_vma_node) { - ret = try_to_unmap_one(page, vma, migration); + if (TRY_TO_UNLOCK && unlikely(unlock)) { + if (!(vma->vm_flags & VM_LOCKED)) + continue; /* must visit all vmas */ + mlocked++; + break; /* no need to look further */ + } else + ret = try_to_unmap_one(page, vma, migration); if (ret == SWAP_FAIL || !page_mapped(page)) break; + if (ret == SWAP_MLOCK) { + if (down_read_trylock(&vma->vm_mm->mmap_sem)) { + if (vma->vm_flags & VM_LOCKED) { + mlock_vma_page(page); + mlocked++; + } + up_read(&vma->vm_mm->mmap_sem); + } + } } - page_unlock_anon_vma(anon_vma); + + if (mlocked) + ret = SWAP_MLOCK; + else if (ret == SWAP_MLOCK) + ret = SWAP_AGAIN; + return ret; } /** - * try_to_unmap_file - unmap file page using the object-based rmap method - * @page: the page to unmap + * try_to_unmap_file - unmap or unlock file page using the object-based + * rmap method + * @page: the page to unmap/unlock + * @unlock: request for unlock rather than unmap [unlikely] + * @migration: unmapping for migration - ignored if @unlock * * Find all the mappings of a page using the mapping pointer and the vma chains * contained in the address_space struct it points to. * - * This function is only called from try_to_unmap for object-based pages. + * This function is only called from try_to_unmap/try_to_unlock for + * object-based pages. + * When called from try_to_unlock(), the mmap_sem of the mm containing the vma + * where the page was found will be held for write. So, we won't recheck + * vm_flags for that VMA. That should be OK, because that vma shouldn't be + * 'LOCKED. */ -static int try_to_unmap_file(struct page *page, int migration) +static int try_to_unmap_file(struct page *page, int unlock, int migration) { struct address_space *mapping = page->mapping; pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT); @@ -908,20 +968,47 @@ static int try_to_unmap_file(struct page unsigned long max_nl_cursor = 0; unsigned long max_nl_size = 0; unsigned int mapcount; + unsigned int mlocked = 0; read_lock(&mapping->i_mmap_lock); vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) { - ret = try_to_unmap_one(page, vma, migration); + if (TRY_TO_UNLOCK && unlikely(unlock)) { + if (!(vma->vm_flags & VM_LOCKED)) + continue; /* must visit all vmas */ + mlocked++; + break; /* no need to look further */ + } else + ret = try_to_unmap_one(page, vma, migration); if (ret == SWAP_FAIL || !page_mapped(page)) goto out; + if (ret == SWAP_MLOCK) { + if (down_read_trylock(&vma->vm_mm->mmap_sem)) { + if (vma->vm_flags & VM_LOCKED) { + mlock_vma_page(page); + mlocked++; + } + up_read(&vma->vm_mm->mmap_sem); + } + if (unlikely(unlock)) + break; /* stop on 1st mlocked vma */ + } } + if (mlocked) + goto out; + if (list_empty(&mapping->i_mmap_nonlinear)) goto out; list_for_each_entry(vma, &mapping->i_mmap_nonlinear, shared.vm_set.list) { - if ((vma->vm_flags & VM_LOCKED) && !migration) + if (TRY_TO_UNLOCK && unlikely(unlock)) { + if (!(vma->vm_flags & VM_LOCKED)) + continue; /* must visit all vmas */ + mlocked++; + goto out; /* no need to look further */ + } + if (!migration && (vma->vm_flags & VM_LOCKED)) continue; cursor = (unsigned long) vma->vm_private_data; if (cursor > max_nl_cursor) @@ -955,8 +1042,6 @@ static int try_to_unmap_file(struct page do { list_for_each_entry(vma, &mapping->i_mmap_nonlinear, shared.vm_set.list) { - if ((vma->vm_flags & VM_LOCKED) && !migration) - continue; cursor = (unsigned long) vma->vm_private_data; while ( cursor < max_nl_cursor && cursor < vma->vm_end - vma->vm_start) { @@ -981,6 +1066,10 @@ static int try_to_unmap_file(struct page vma->vm_private_data = NULL; out: read_unlock(&mapping->i_mmap_lock); + if (mlocked) + ret = SWAP_MLOCK; + else if (ret == SWAP_MLOCK) + ret = SWAP_AGAIN; return ret; } @@ -995,6 +1084,7 @@ out: * SWAP_SUCCESS - we succeeded in removing all mappings * SWAP_AGAIN - we missed a mapping, try again later * SWAP_FAIL - the page is unswappable + * SWAP_MLOCK - page is mlocked. */ int try_to_unmap(struct page *page, int migration) { @@ -1003,12 +1093,32 @@ int try_to_unmap(struct page *page, int BUG_ON(!PageLocked(page)); if (PageAnon(page)) - ret = try_to_unmap_anon(page, migration); + ret = try_to_unmap_anon(page, 0, migration); else - ret = try_to_unmap_file(page, migration); - - if (!page_mapped(page)) + ret = try_to_unmap_file(page, 0, migration); + if (ret != SWAP_MLOCK && !page_mapped(page)) ret = SWAP_SUCCESS; return ret; } +#ifdef CONFIG_NORECLAIM_MLOCK +/** + * try_to_unlock - Check page's rmap for other vma's holding page locked. + * @page: the page to be unlocked. will be returned with PG_mlocked + * cleared if no vmas are VM_LOCKED. + * + * Return values are: + * + * SWAP_SUCCESS - no vma's holding page locked. + * SWAP_MLOCK - page is mlocked. + */ +int try_to_unlock(struct page *page) +{ + VM_BUG_ON(!PageLocked(page) || PageLRU(page)); + + if (PageAnon(page)) + return(try_to_unmap_anon(page, 1, 0)); + else + return(try_to_unmap_file(page, 1, 0)); +} +#endif Index: linux-2.6.24-rc4-mm1/mm/migrate.c =================================================================== --- linux-2.6.24-rc4-mm1.orig/mm/migrate.c +++ linux-2.6.24-rc4-mm1/mm/migrate.c @@ -366,6 +366,9 @@ static void migrate_page_copy(struct pag set_page_dirty(newpage); } + if (TestClearPageMlocked(page)) + SetPageMlocked(newpage); + #ifdef CONFIG_SWAP ClearPageSwapCache(page); #endif Index: linux-2.6.24-rc4-mm1/mm/page_alloc.c =================================================================== --- linux-2.6.24-rc4-mm1.orig/mm/page_alloc.c +++ linux-2.6.24-rc4-mm1/mm/page_alloc.c @@ -255,6 +255,7 @@ static void bad_page(struct page *page) 1 << PG_swapcache | 1 << PG_writeback | 1 << PG_swapbacked | + 1 << PG_mlocked | 1 << PG_buddy ); set_page_count(page, 0); reset_page_mapcount(page); @@ -484,6 +485,9 @@ static inline int free_pages_check(struc 1 << PG_writeback | 1 << PG_reserved | 1 << PG_noreclaim | +// TODO: always trip this under heavy workloads. +// Why isn't this being cleared on last unmap/unlock? +// 1 << PG_mlocked | 1 << PG_buddy )))) bad_page(page); if (PageDirty(page)) @@ -638,6 +642,8 @@ static int prep_new_page(struct page *pa 1 << PG_writeback | 1 << PG_reserved | 1 << PG_swapbacked | +//TODO: why hitting this? +// 1 << PG_mlocked | 1 << PG_buddy )))) bad_page(page); @@ -650,7 +656,9 @@ static int prep_new_page(struct page *pa page->flags &= ~(1 << PG_uptodate | 1 << PG_error | 1 << PG_readahead | 1 << PG_referenced | 1 << PG_arch_1 | - 1 << PG_owner_priv_1 | 1 << PG_mappedtodisk); + 1 << PG_owner_priv_1 | 1 << PG_mappedtodisk | +//TODO take care of it here, for now. + 1 << PG_mlocked ); set_page_private(page, 0); set_page_refcounted(page); Index: linux-2.6.24-rc4-mm1/mm/swap.c =================================================================== --- linux-2.6.24-rc4-mm1.orig/mm/swap.c +++ linux-2.6.24-rc4-mm1/mm/swap.c @@ -346,7 +346,7 @@ void lru_add_drain(void) put_cpu(); } -#ifdef CONFIG_NUMA +#if defined(CONFIG_NUMA) || defined(CONFIG_NORECLAIM_MLOCK) static void lru_add_drain_per_cpu(struct work_struct *dummy) { lru_add_drain(); -- All Rights Reversed -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/