Larry Woodman reported current VM has big performance regression when
AIM7 benchmark. It use very much processes and fork/exit/page-fault frequently.
(i.e. it makes serious lock contention of ptelock and anon_vma-lock.)
At 2.6.28, we removed calc_reclaim_mapped() and it made vmscan, then
vmscan always call page_referenced() although VM pressure is low.
It increased lock contention more, unfortunately.
Larry, can you please try this patch series on your big box?
page reclaim logic need to distingish mapped and unmapped pages.
However page_mapping_inuse() don't provide proper test way. it test
the address space (i.e. file) is mmpad(). Why `page' reclaim need
care unrelated page's mapped state? it's unrelated.
Thus, This patch replace page_mapping_inuse() with page_mapped()
Signed-off-by: KOSAKI Motohiro <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
Reviewed-by: Johannes Weiner <[email protected]>
---
mm/vmscan.c | 25 ++-----------------------
1 files changed, 2 insertions(+), 23 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 2bbee91..3366bec 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -262,27 +262,6 @@ unsigned long shrink_slab(unsigned long scanned, gfp_t gfp_mask,
return ret;
}
-/* Called without lock on whether page is mapped, so answer is unstable */
-static inline int page_mapping_inuse(struct page *page)
-{
- struct address_space *mapping;
-
- /* Page is in somebody's page tables. */
- if (page_mapped(page))
- return 1;
-
- /* Be more reluctant to reclaim swapcache than pagecache */
- if (PageSwapCache(page))
- return 1;
-
- mapping = page_mapping(page);
- if (!mapping)
- return 0;
-
- /* File is mmap'd by somebody? */
- return mapping_mapped(mapping);
-}
-
static inline int is_page_cache_freeable(struct page *page)
{
/*
@@ -649,7 +628,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
* try_to_unmap moves it to unevictable list
*/
if (sc->order <= PAGE_ALLOC_COSTLY_ORDER &&
- referenced && page_mapping_inuse(page)
+ referenced && page_mapped(page)
&& !(vm_flags & VM_LOCKED))
goto activate_locked;
@@ -1351,7 +1330,7 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
}
/* page_referenced clears PageReferenced */
- if (page_mapping_inuse(page) &&
+ if (page_mapped(page) &&
page_referenced(page, 0, sc->mem_cgroup, &vm_flags)) {
nr_rotated++;
/*
--
1.6.5.2
page_check_address() need to take ptelock. but it might be contended.
Then we need trylock version and this patch introduce new helper function.
it will be used latter patch.
Signed-off-by: KOSAKI Motohiro <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
---
mm/rmap.c | 62 ++++++++++++++++++++++++++++++++++++++++++++++++------------
1 files changed, 49 insertions(+), 13 deletions(-)
diff --git a/mm/rmap.c b/mm/rmap.c
index 278cd27..1b50425 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -268,44 +268,80 @@ unsigned long page_address_in_vma(struct page *page, struct vm_area_struct *vma)
* the page table lock when the pte is not present (helpful when reclaiming
* highly shared pages).
*
- * On success returns with pte mapped and locked.
+ * if @noblock is true, page_check_address may return -EAGAIN if lock is
+ * contended.
+ *
+ * Returns valid pte pointer if success.
+ * Returns -EFAULT if address seems invalid.
+ * Returns -EAGAIN if trylock failed.
*/
-pte_t *page_check_address(struct page *page, struct mm_struct *mm,
- unsigned long address, spinlock_t **ptlp, int sync)
+static pte_t *__page_check_address(struct page *page, struct mm_struct *mm,
+ unsigned long address, spinlock_t **ptlp,
+ int sync, int noblock)
{
pgd_t *pgd;
pud_t *pud;
pmd_t *pmd;
pte_t *pte;
spinlock_t *ptl;
+ int err = -EFAULT;
pgd = pgd_offset(mm, address);
if (!pgd_present(*pgd))
- return NULL;
+ goto out;
pud = pud_offset(pgd, address);
if (!pud_present(*pud))
- return NULL;
+ goto out;
pmd = pmd_offset(pud, address);
if (!pmd_present(*pmd))
- return NULL;
+ goto out;
pte = pte_offset_map(pmd, address);
/* Make a quick check before getting the lock */
- if (!sync && !pte_present(*pte)) {
- pte_unmap(pte);
- return NULL;
- }
+ if (!sync && !pte_present(*pte))
+ goto out_unmap;
ptl = pte_lockptr(mm, pmd);
- spin_lock(ptl);
+ if (noblock) {
+ if (!spin_trylock(ptl)) {
+ err = -EAGAIN;
+ goto out_unmap;
+ }
+ } else
+ spin_lock(ptl);
+
if (pte_present(*pte) && page_to_pfn(page) == pte_pfn(*pte)) {
*ptlp = ptl;
return pte;
}
- pte_unmap_unlock(pte, ptl);
- return NULL;
+
+ spin_unlock(ptl);
+ out_unmap:
+ pte_unmap(pte);
+ out:
+ return ERR_PTR(err);
+}
+
+/*
+ * Check that @page is mapped at @address into @mm.
+ *
+ * If @sync is false, page_check_address may perform a racy check to avoid
+ * the page table lock when the pte is not present (helpful when reclaiming
+ * highly shared pages).
+ *
+ * On success returns with pte mapped and locked.
+ */
+pte_t *page_check_address(struct page *page, struct mm_struct *mm,
+ unsigned long address, spinlock_t **ptlp, int sync)
+{
+ pte_t *pte;
+
+ pte = __page_check_address(page, mm, address, ptlp, sync, 0);
+ if (IS_ERR(pte))
+ return NULL;
+ return pte;
}
/**
--
1.6.5.2
Currently, page_referenced_one() check VM_LOCKED after taking ptelock.
But it's unnecessary. We can check VM_LOCKED before to take lock.
This patch does it.
Signed-off-by: KOSAKI Motohiro <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
---
mm/rmap.c | 13 ++++++-------
1 files changed, 6 insertions(+), 7 deletions(-)
diff --git a/mm/rmap.c b/mm/rmap.c
index 1b50425..fb0983a 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -383,21 +383,21 @@ int page_referenced_one(struct page *page, struct vm_area_struct *vma,
spinlock_t *ptl;
int referenced = 0;
- pte = page_check_address(page, mm, address, &ptl, 0);
- if (!pte)
- goto out;
-
/*
* Don't want to elevate referenced for mlocked page that gets this far,
* in order that it progresses to try_to_unmap and is moved to the
* unevictable list.
*/
if (vma->vm_flags & VM_LOCKED) {
- *mapcount = 1; /* break early from loop */
+ *mapcount = 0; /* break early from loop */
*vm_flags |= VM_LOCKED;
- goto out_unmap;
+ goto out;
}
+ pte = page_check_address(page, mm, address, &ptl, 0);
+ if (!pte)
+ goto out;
+
if (ptep_clear_flush_young_notify(vma, address, pte)) {
/*
* Don't treat a reference through a sequentially read
@@ -416,7 +416,6 @@ int page_referenced_one(struct page *page, struct vm_area_struct *vma,
rwsem_is_locked(&mm->mmap_sem))
referenced++;
-out_unmap:
(*mapcount)--;
pte_unmap_unlock(pte, ptl);
--
1.6.5.2
page_referenced() imply "test the page was referenced or not", but
shrink_active_list() use it for drop pte young bit. then, it should be
renamed.
Plus, vm_flags argument is really ugly. instead, introduce new
struct page_reference_context, it's for collect some statistics.
This patch doesn't have any behavior change.
Signed-off-by: KOSAKI Motohiro <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
---
include/linux/ksm.h | 11 +++--
include/linux/rmap.h | 16 +++++--
mm/ksm.c | 17 +++++---
mm/rmap.c | 112 ++++++++++++++++++++++++-------------------------
mm/vmscan.c | 46 ++++++++++++--------
5 files changed, 112 insertions(+), 90 deletions(-)
diff --git a/include/linux/ksm.h b/include/linux/ksm.h
index bed5f16..e1a60d3 100644
--- a/include/linux/ksm.h
+++ b/include/linux/ksm.h
@@ -85,8 +85,8 @@ static inline struct page *ksm_might_need_to_copy(struct page *page,
return ksm_does_need_to_copy(page, vma, address);
}
-int page_referenced_ksm(struct page *page,
- struct mem_cgroup *memcg, unsigned long *vm_flags);
+int wipe_page_reference_ksm(struct page *page, struct mem_cgroup *memcg,
+ struct page_reference_context *refctx);
int try_to_unmap_ksm(struct page *page, enum ttu_flags flags);
int rmap_walk_ksm(struct page *page, int (*rmap_one)(struct page *,
struct vm_area_struct *, unsigned long, void *), void *arg);
@@ -120,10 +120,11 @@ static inline struct page *ksm_might_need_to_copy(struct page *page,
return page;
}
-static inline int page_referenced_ksm(struct page *page,
- struct mem_cgroup *memcg, unsigned long *vm_flags)
+static inline int wipe_page_reference_ksm(struct page *page,
+ struct mem_cgroup *memcg,
+ struct page_reference_context *refctx)
{
- return 0;
+ return SWAP_SUCCESS;
}
static inline int try_to_unmap_ksm(struct page *page, enum ttu_flags flags)
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index b019ae6..564d981 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -108,13 +108,21 @@ static inline void page_dup_rmap(struct page *page)
atomic_inc(&page->_mapcount);
}
+struct page_reference_context {
+ int is_page_locked;
+ unsigned long referenced;
+ unsigned long exec_referenced;
+ int maybe_mlocked; /* found VM_LOCKED, but it's unstable result */
+};
+
/*
* Called from mm/vmscan.c to handle paging out
*/
-int page_referenced(struct page *, int is_locked,
- struct mem_cgroup *cnt, unsigned long *vm_flags);
-int page_referenced_one(struct page *, struct vm_area_struct *,
- unsigned long address, unsigned int *mapcount, unsigned long *vm_flags);
+int wipe_page_reference(struct page *page, struct mem_cgroup *memcg,
+ struct page_reference_context *refctx);
+int wipe_page_reference_one(struct page *page,
+ struct page_reference_context *refctx,
+ struct vm_area_struct *vma, unsigned long address);
enum ttu_flags {
TTU_UNMAP = 0, /* unmap mode */
diff --git a/mm/ksm.c b/mm/ksm.c
index 56a0da1..19559ae 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1544,15 +1544,15 @@ struct page *ksm_does_need_to_copy(struct page *page,
return new_page;
}
-int page_referenced_ksm(struct page *page, struct mem_cgroup *memcg,
- unsigned long *vm_flags)
+int wipe_page_reference_ksm(struct page *page, struct mem_cgroup *memcg,
+ struct page_reference_context *refctx)
{
struct stable_node *stable_node;
struct rmap_item *rmap_item;
struct hlist_node *hlist;
unsigned int mapcount = page_mapcount(page);
- int referenced = 0;
int search_new_forks = 0;
+ int ret = SWAP_SUCCESS;
VM_BUG_ON(!PageKsm(page));
VM_BUG_ON(!PageLocked(page));
@@ -1582,10 +1582,15 @@ again:
if (memcg && !mm_match_cgroup(vma->vm_mm, memcg))
continue;
- referenced += page_referenced_one(page, vma,
- rmap_item->address, &mapcount, vm_flags);
+ ret = wipe_page_reference_one(page, refctx, vma,
+ rmap_item->address);
+ if (ret != SWAP_SUCCESS)
+ goto out;
+ mapcount--;
if (!search_new_forks || !mapcount)
break;
+ if (refctx->maybe_mlocked)
+ goto out;
}
spin_unlock(&anon_vma->lock);
if (!mapcount)
@@ -1594,7 +1599,7 @@ again:
if (!search_new_forks++)
goto again;
out:
- return referenced;
+ return ret;
}
int try_to_unmap_ksm(struct page *page, enum ttu_flags flags)
diff --git a/mm/rmap.c b/mm/rmap.c
index fb0983a..2f4451b 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -371,17 +371,16 @@ int page_mapped_in_vma(struct page *page, struct vm_area_struct *vma)
}
/*
- * Subfunctions of page_referenced: page_referenced_one called
- * repeatedly from either page_referenced_anon or page_referenced_file.
+ * Subfunctions of wipe_page_reference: wipe_page_reference_one called
+ * repeatedly from either wipe_page_reference_anon or wipe_page_reference_file.
*/
-int page_referenced_one(struct page *page, struct vm_area_struct *vma,
- unsigned long address, unsigned int *mapcount,
- unsigned long *vm_flags)
+int wipe_page_reference_one(struct page *page,
+ struct page_reference_context *refctx,
+ struct vm_area_struct *vma, unsigned long address)
{
struct mm_struct *mm = vma->vm_mm;
pte_t *pte;
spinlock_t *ptl;
- int referenced = 0;
/*
* Don't want to elevate referenced for mlocked page that gets this far,
@@ -389,8 +388,7 @@ int page_referenced_one(struct page *page, struct vm_area_struct *vma,
* unevictable list.
*/
if (vma->vm_flags & VM_LOCKED) {
- *mapcount = 0; /* break early from loop */
- *vm_flags |= VM_LOCKED;
+ refctx->maybe_mlocked = 1;
goto out;
}
@@ -406,37 +404,38 @@ int page_referenced_one(struct page *page, struct vm_area_struct *vma,
* mapping is already gone, the unmap path will have
* set PG_referenced or activated the page.
*/
- if (likely(!VM_SequentialReadHint(vma)))
- referenced++;
+ if (likely(!VM_SequentialReadHint(vma))) {
+ refctx->referenced++;
+ if (vma->vm_flags & VM_EXEC) {
+ refctx->exec_referenced++;
+ }
+ }
}
/* Pretend the page is referenced if the task has the
swap token and is in the middle of a page fault. */
if (mm != current->mm && has_swap_token(mm) &&
rwsem_is_locked(&mm->mmap_sem))
- referenced++;
+ refctx->referenced++;
- (*mapcount)--;
pte_unmap_unlock(pte, ptl);
- if (referenced)
- *vm_flags |= vma->vm_flags;
out:
- return referenced;
+ return SWAP_SUCCESS;
}
-static int page_referenced_anon(struct page *page,
- struct mem_cgroup *mem_cont,
- unsigned long *vm_flags)
+static int wipe_page_reference_anon(struct page *page,
+ struct mem_cgroup *memcg,
+ struct page_reference_context *refctx)
{
unsigned int mapcount;
struct anon_vma *anon_vma;
struct vm_area_struct *vma;
- int referenced = 0;
+ int ret = SWAP_SUCCESS;
anon_vma = page_lock_anon_vma(page);
if (!anon_vma)
- return referenced;
+ return ret;
mapcount = page_mapcount(page);
list_for_each_entry(vma, &anon_vma->head, anon_vma_node) {
@@ -448,20 +447,22 @@ static int page_referenced_anon(struct page *page,
* counting on behalf of references from different
* cgroups
*/
- if (mem_cont && !mm_match_cgroup(vma->vm_mm, mem_cont))
+ if (memcg && !mm_match_cgroup(vma->vm_mm, memcg))
continue;
- referenced += page_referenced_one(page, vma, address,
- &mapcount, vm_flags);
- if (!mapcount)
+ ret = wipe_page_reference_one(page, refctx, vma, address);
+ if (ret != SWAP_SUCCESS)
+ break;
+ mapcount--;
+ if (!mapcount || refctx->maybe_mlocked)
break;
}
page_unlock_anon_vma(anon_vma);
- return referenced;
+ return ret;
}
/**
- * page_referenced_file - referenced check for object-based rmap
+ * wipe_page_reference_file - wipe page reference for object-based rmap
* @page: the page we're checking references on.
* @mem_cont: target memory controller
* @vm_flags: collect encountered vma->vm_flags who actually referenced the page
@@ -471,18 +472,18 @@ static int page_referenced_anon(struct page *page,
* pointer, then walking the chain of vmas it holds. It returns the number
* of references it found.
*
- * This function is only called from page_referenced for object-based pages.
+ * This function is only called from wipe_page_reference for object-based pages.
*/
-static int page_referenced_file(struct page *page,
- struct mem_cgroup *mem_cont,
- unsigned long *vm_flags)
+static int wipe_page_reference_file(struct page *page,
+ struct mem_cgroup *memcg,
+ struct page_reference_context *refctx)
{
unsigned int mapcount;
struct address_space *mapping = page->mapping;
pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
struct vm_area_struct *vma;
struct prio_tree_iter iter;
- int referenced = 0;
+ int ret = SWAP_SUCCESS;
/*
* The caller's checks on page->mapping and !PageAnon have made
@@ -516,65 +517,62 @@ static int page_referenced_file(struct page *page,
* counting on behalf of references from different
* cgroups
*/
- if (mem_cont && !mm_match_cgroup(vma->vm_mm, mem_cont))
+ if (memcg && !mm_match_cgroup(vma->vm_mm, memcg))
continue;
- referenced += page_referenced_one(page, vma, address,
- &mapcount, vm_flags);
- if (!mapcount)
+ ret = wipe_page_reference_one(page, refctx, vma, address);
+ if (ret != SWAP_SUCCESS)
+ break;
+ mapcount--;
+ if (!mapcount || refctx->maybe_mlocked)
break;
}
spin_unlock(&mapping->i_mmap_lock);
- return referenced;
+ return ret;
}
/**
- * page_referenced - test if the page was referenced
+ * wipe_page_reference - clear and test the page reference and pte young bit
* @page: the page to test
- * @is_locked: caller holds lock on the page
- * @mem_cont: target memory controller
- * @vm_flags: collect encountered vma->vm_flags who actually referenced the page
+ * @memcg: target memory controller
+ * @refctx: context for collect some statistics
*
* Quick test_and_clear_referenced for all mappings to a page,
* returns the number of ptes which referenced the page.
*/
-int page_referenced(struct page *page,
- int is_locked,
- struct mem_cgroup *mem_cont,
- unsigned long *vm_flags)
+int wipe_page_reference(struct page *page,
+ struct mem_cgroup *memcg,
+ struct page_reference_context *refctx)
{
- int referenced = 0;
int we_locked = 0;
+ int ret = SWAP_SUCCESS;
if (TestClearPageReferenced(page))
- referenced++;
+ refctx->referenced++;
- *vm_flags = 0;
if (page_mapped(page) && page_rmapping(page)) {
- if (!is_locked && (!PageAnon(page) || PageKsm(page))) {
+ if (!refctx->is_page_locked &&
+ (!PageAnon(page) || PageKsm(page))) {
we_locked = trylock_page(page);
if (!we_locked) {
- referenced++;
+ refctx->referenced++;
goto out;
}
}
if (unlikely(PageKsm(page)))
- referenced += page_referenced_ksm(page, mem_cont,
- vm_flags);
+ ret = wipe_page_reference_ksm(page, memcg, refctx);
else if (PageAnon(page))
- referenced += page_referenced_anon(page, mem_cont,
- vm_flags);
+ ret = wipe_page_reference_anon(page, memcg, refctx);
else if (page->mapping)
- referenced += page_referenced_file(page, mem_cont,
- vm_flags);
+ ret = wipe_page_reference_file(page, memcg, refctx);
if (we_locked)
unlock_page(page);
}
out:
if (page_test_and_clear_young(page))
- referenced++;
+ refctx->referenced++;
- return referenced;
+ return ret;
}
static int page_mkclean_one(struct page *page, struct vm_area_struct *vma,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 3366bec..c59baa9 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -569,7 +569,6 @@ static unsigned long shrink_page_list(struct list_head *page_list,
struct pagevec freed_pvec;
int pgactivate = 0;
unsigned long nr_reclaimed = 0;
- unsigned long vm_flags;
cond_resched();
@@ -578,7 +577,9 @@ static unsigned long shrink_page_list(struct list_head *page_list,
struct address_space *mapping;
struct page *page;
int may_enter_fs;
- int referenced;
+ struct page_reference_context refctx = {
+ .is_page_locked = 1,
+ };
cond_resched();
@@ -620,16 +621,15 @@ static unsigned long shrink_page_list(struct list_head *page_list,
goto keep_locked;
}
- referenced = page_referenced(page, 1,
- sc->mem_cgroup, &vm_flags);
+ wipe_page_reference(page, sc->mem_cgroup, &refctx);
/*
* In active use or really unfreeable? Activate it.
* If page which have PG_mlocked lost isoltation race,
* try_to_unmap moves it to unevictable list
*/
if (sc->order <= PAGE_ALLOC_COSTLY_ORDER &&
- referenced && page_mapped(page)
- && !(vm_flags & VM_LOCKED))
+ page_mapped(page) && refctx.referenced &&
+ !refctx.maybe_mlocked)
goto activate_locked;
/*
@@ -664,7 +664,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
}
if (PageDirty(page)) {
- if (sc->order <= PAGE_ALLOC_COSTLY_ORDER && referenced)
+ if (sc->order <= PAGE_ALLOC_COSTLY_ORDER &&
+ refctx.referenced)
goto keep_locked;
if (!may_enter_fs)
goto keep_locked;
@@ -1241,9 +1242,10 @@ static inline void note_zone_scanning_priority(struct zone *zone, int priority)
*
* If the pages are mostly unmapped, the processing is fast and it is
* appropriate to hold zone->lru_lock across the whole operation. But if
- * the pages are mapped, the processing is slow (page_referenced()) so we
- * should drop zone->lru_lock around each page. It's impossible to balance
- * this, so instead we remove the pages from the LRU while processing them.
+ * the pages are mapped, the processing is slow (because wipe_page_reference()
+ * walk each ptes) so we should drop zone->lru_lock around each page. It's
+ * impossible to balance this, so instead we remove the pages from the LRU
+ * while processing them.
* It is safe to rely on PG_active against the non-LRU pages in here because
* nobody will play with that bit on a non-LRU page.
*
@@ -1289,7 +1291,6 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
{
unsigned long nr_taken;
unsigned long pgscanned;
- unsigned long vm_flags;
LIST_HEAD(l_hold); /* The pages which were snipped off */
LIST_HEAD(l_active);
LIST_HEAD(l_inactive);
@@ -1320,6 +1321,10 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
spin_unlock_irq(&zone->lru_lock);
while (!list_empty(&l_hold)) {
+ struct page_reference_context refctx = {
+ .is_page_locked = 0,
+ };
+
cond_resched();
page = lru_to_page(&l_hold);
list_del(&page->lru);
@@ -1329,10 +1334,17 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
continue;
}
- /* page_referenced clears PageReferenced */
- if (page_mapped(page) &&
- page_referenced(page, 0, sc->mem_cgroup, &vm_flags)) {
+ if (!page_mapped(page)) {
+ ClearPageActive(page); /* we are de-activating */
+ list_add(&page->lru, &l_inactive);
+ continue;
+ }
+
+ wipe_page_reference(page, sc->mem_cgroup, &refctx);
+
+ if (refctx.referenced)
nr_rotated++;
+ if (refctx.exec_referenced && page_is_file_cache(page)) {
/*
* Identify referenced, file-backed active pages and
* give them one more trip around the active list. So
@@ -1342,10 +1354,8 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
* IO, plus JVM can create lots of anon VM_EXEC pages,
* so we ignore them here.
*/
- if ((vm_flags & VM_EXEC) && page_is_file_cache(page)) {
- list_add(&page->lru, &l_active);
- continue;
- }
+ list_add(&page->lru, &l_active);
+ continue;
}
ClearPageActive(page); /* we are de-activating */
--
1.6.5.2
Currently, wipe_page_reference() increment refctx->referenced variable
if trylock_page() is failed. but it is meaningless at all.
shrink_active_list() deactivate the page although the page was
referenced. The page shouldn't be deactivated with young bit. it
break reclaim basic theory and decrease reclaim throughput.
This patch introduce new SWAP_AGAIN return value to
wipe_page_reference().
Signed-off-by: KOSAKI Motohiro <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
---
mm/rmap.c | 5 ++++-
mm/vmscan.c | 15 +++++++++++++--
2 files changed, 17 insertions(+), 3 deletions(-)
diff --git a/mm/rmap.c b/mm/rmap.c
index 2f4451b..b84f350 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -539,6 +539,9 @@ static int wipe_page_reference_file(struct page *page,
*
* Quick test_and_clear_referenced for all mappings to a page,
* returns the number of ptes which referenced the page.
+ *
+ * SWAP_SUCCESS - success to wipe all ptes
+ * SWAP_AGAIN - temporary busy, try again later
*/
int wipe_page_reference(struct page *page,
struct mem_cgroup *memcg,
@@ -555,7 +558,7 @@ int wipe_page_reference(struct page *page,
(!PageAnon(page) || PageKsm(page))) {
we_locked = trylock_page(page);
if (!we_locked) {
- refctx->referenced++;
+ ret = SWAP_AGAIN;
goto out;
}
}
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c59baa9..a01cf5e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -577,6 +577,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
struct address_space *mapping;
struct page *page;
int may_enter_fs;
+ int ret;
struct page_reference_context refctx = {
.is_page_locked = 1,
};
@@ -621,7 +622,11 @@ static unsigned long shrink_page_list(struct list_head *page_list,
goto keep_locked;
}
- wipe_page_reference(page, sc->mem_cgroup, &refctx);
+ ret = wipe_page_reference(page, sc->mem_cgroup, &refctx);
+ if (ret == SWAP_AGAIN)
+ goto keep_locked;
+ VM_BUG_ON(ret != SWAP_SUCCESS);
+
/*
* In active use or really unfreeable? Activate it.
* If page which have PG_mlocked lost isoltation race,
@@ -1321,6 +1326,7 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
spin_unlock_irq(&zone->lru_lock);
while (!list_empty(&l_hold)) {
+ int ret;
struct page_reference_context refctx = {
.is_page_locked = 0,
};
@@ -1340,7 +1346,12 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
continue;
}
- wipe_page_reference(page, sc->mem_cgroup, &refctx);
+ ret = wipe_page_reference(page, sc->mem_cgroup, &refctx);
+ if (ret == SWAP_AGAIN) {
+ list_add(&page->lru, &l_active);
+ continue;
+ }
+ VM_BUG_ON(ret != SWAP_SUCCESS);
if (refctx.referenced)
nr_rotated++;
--
1.6.5.2
Larry Woodman reported AIM7 makes serious ptelock and anon_vma_lock
contention on current VM. because SplitLRU VM (since 2.6.28) remove
calc_reclaim_mapped() test, then shrink_active_list() always call
page_referenced() against mapped page although VM pressure is low.
Lightweight VM pressure is very common situation and it easily makes
ptelock contention with page fault. then, anon_vma_lock is holding
long time and it makes another lock contention. then, fork/exit
throughput decrease a lot.
While running workloads that do lots of forking processes, exiting
processes and page reclamation(AIM 7) on large systems very high system
time(100%) and lots of lock contention was observed.
CPU5:
[<ffffffff814afb48>] ? _spin_lock+0x27/0x48
[<ffffffff81101deb>] ? anon_vma_link+0x2a/0x5a
[<ffffffff8105d3d8>] ? dup_mm+0x242/0x40c
[<ffffffff8105e0a9>] ? copy_process+0xab1/0x12be
[<ffffffff8105ea07>] ? do_fork+0x151/0x330
[<ffffffff81058407>] ? default_wake_function+0x0/0x36
[<ffffffff814b0243>] ? _spin_lock_irqsave+0x2f/0x68
[<ffffffff810121d3>] ? stub_clone+0x13/0x20
[<ffffffff81011e02>] ? system_call_fastpath+0x16/0x1b
CPU4:
[<ffffffff814afb4a>] ? _spin_lock+0x29/0x48
[<ffffffff81103062>] ? anon_vma_unlink+0x2a/0x84
[<ffffffff810fbab7>] ? free_pgtables+0x3c/0xe1
[<ffffffff810fd8b1>] ? exit_mmap+0xc5/0x110
[<ffffffff8105ce4c>] ? mmput+0x55/0xd9
[<ffffffff81061afd>] ? exit_mm+0x109/0x129
[<ffffffff81063846>] ? do_exit+0x1d7/0x712
[<ffffffff814b0243>] ? _spin_lock_irqsave+0x2f/0x68
[<ffffffff81063e07>] ? do_group_exit+0x86/0xb2
[<ffffffff81063e55>] ? sys_exit_group+0x22/0x3e
[<ffffffff81011e02>] ? system_call_fastpath+0x16/0x1b
CPU0:
[<ffffffff814afb4a>] ? _spin_lock+0x29/0x48
[<ffffffff81101ad1>] ? page_check_address+0x9e/0x16f
[<ffffffff81101cb8>] ? page_referenced_one+0x53/0x10b
[<ffffffff81102f5a>] ? page_referenced+0xcd/0x167
[<ffffffff810eb32d>] ? shrink_active_list+0x1ed/0x2a3
[<ffffffff810ebde9>] ? shrink_zone+0xa06/0xa38
[<ffffffff8108440a>] ? getnstimeofday+0x64/0xce
[<ffffffff810ecaf9>] ? do_try_to_free_pages+0x1e5/0x362
[<ffffffff810ecd9f>] ? try_to_free_pages+0x7a/0x94
[<ffffffff810ea66f>] ? isolate_pages_global+0x0/0x242
[<ffffffff810e57b9>] ? __alloc_pages_nodemask+0x397/0x572
[<ffffffff810e3c1e>] ? __get_free_pages+0x19/0x6e
[<ffffffff8105d6c9>] ? copy_process+0xd1/0x12be
[<ffffffff81204eb2>] ? avc_has_perm+0x5c/0x84
[<ffffffff81130db8>] ? user_path_at+0x65/0xa3
[<ffffffff8105ea07>] ? do_fork+0x151/0x330
[<ffffffff810b7935>] ? check_for_new_grace_period+0x78/0xab
[<ffffffff810121d3>] ? stub_clone+0x13/0x20
[<ffffffff81011e02>] ? system_call_fastpath+0x16/0x1b
------------------------------------------------------------------------------
PerfTop: 864 irqs/sec kernel:99.7% [100000 cycles], (all, 8 CPUs)
------------------------------------------------------------------------------
samples pcnt RIP kernel function
______ _______ _____ ________________ _______________
3235.00 - 75.1% - ffffffff814afb21 : _spin_lock
670.00 - 15.6% - ffffffff81101a33 : page_check_address
165.00 - 3.8% - ffffffffa01cbc39 : rpc_sleep_on [sunrpc]
40.00 - 0.9% - ffffffff81102113 : try_to_unmap_one
29.00 - 0.7% - ffffffff81101c65 : page_referenced_one
27.00 - 0.6% - ffffffff81101964 : vma_address
8.00 - 0.2% - ffffffff8125a5a0 : clear_page_c
6.00 - 0.1% - ffffffff8125a5f0 : copy_page_c
6.00 - 0.1% - ffffffff811023ca : try_to_unmap_anon
5.00 - 0.1% - ffffffff810fb014 : copy_page_range
5.00 - 0.1% - ffffffff810e4d18 : get_page_from_freelist
Then, We use trylock for avoiding ptelock contention if VM pressure is
low.
Reported-by: Larry Woodman <[email protected]>
Signed-off-by: KOSAKI Motohiro <[email protected]>
Acked-by: Rik van Riel <[email protected]>
---
include/linux/rmap.h | 4 ++++
mm/rmap.c | 16 ++++++++++++----
mm/vmscan.c | 1 +
3 files changed, 17 insertions(+), 4 deletions(-)
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 564d981..499972e 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -110,6 +110,10 @@ static inline void page_dup_rmap(struct page *page)
struct page_reference_context {
int is_page_locked;
+
+ /* if 1, we might give up to wipe when find lock contention. */
+ int soft_try;
+
unsigned long referenced;
unsigned long exec_referenced;
int maybe_mlocked; /* found VM_LOCKED, but it's unstable result */
diff --git a/mm/rmap.c b/mm/rmap.c
index b84f350..5ae7c81 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -373,6 +373,9 @@ int page_mapped_in_vma(struct page *page, struct vm_area_struct *vma)
/*
* Subfunctions of wipe_page_reference: wipe_page_reference_one called
* repeatedly from either wipe_page_reference_anon or wipe_page_reference_file.
+ *
+ * SWAP_SUCCESS - success
+ * SWAP_AGAIN - give up to take lock, try later again
*/
int wipe_page_reference_one(struct page *page,
struct page_reference_context *refctx,
@@ -381,6 +384,7 @@ int wipe_page_reference_one(struct page *page,
struct mm_struct *mm = vma->vm_mm;
pte_t *pte;
spinlock_t *ptl;
+ int ret = SWAP_SUCCESS;
/*
* Don't want to elevate referenced for mlocked page that gets this far,
@@ -392,10 +396,14 @@ int wipe_page_reference_one(struct page *page,
goto out;
}
- pte = page_check_address(page, mm, address, &ptl, 0);
- if (!pte)
+ pte = __page_check_address(page, mm, address, &ptl, 0,
+ refctx->soft_try);
+ if (IS_ERR(pte)) {
+ if (PTR_ERR(pte) == -EAGAIN) {
+ ret = SWAP_AGAIN;
+ }
goto out;
-
+ }
if (ptep_clear_flush_young_notify(vma, address, pte)) {
/*
* Don't treat a reference through a sequentially read
@@ -421,7 +429,7 @@ int wipe_page_reference_one(struct page *page,
pte_unmap_unlock(pte, ptl);
out:
- return SWAP_SUCCESS;
+ return ret;
}
static int wipe_page_reference_anon(struct page *page,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index a01cf5e..c235059 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1329,6 +1329,7 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
int ret;
struct page_reference_context refctx = {
.is_page_locked = 0,
+ .soft_try = (priority < DEF_PRIORITY - 2) ? 0 : 1,
};
cond_resched();
--
1.6.5.2
Both try_to_unmap() and wipe_page_reference() walk each ptes, but
latter doesn't mark PG_mlocked altough find VM_LOCKED vma.
This patch does it.
Signed-off-by: KOSAKI Motohiro <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
---
mm/rmap.c | 14 ++++++++++++++
mm/vmscan.c | 2 ++
2 files changed, 16 insertions(+), 0 deletions(-)
diff --git a/mm/rmap.c b/mm/rmap.c
index 5ae7c81..cfda0a0 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -376,6 +376,7 @@ int page_mapped_in_vma(struct page *page, struct vm_area_struct *vma)
*
* SWAP_SUCCESS - success
* SWAP_AGAIN - give up to take lock, try later again
+ * SWAP_MLOCK - the page is mlocked
*/
int wipe_page_reference_one(struct page *page,
struct page_reference_context *refctx,
@@ -401,6 +402,7 @@ int wipe_page_reference_one(struct page *page,
if (IS_ERR(pte)) {
if (PTR_ERR(pte) == -EAGAIN) {
ret = SWAP_AGAIN;
+ goto out_mlock;
}
goto out;
}
@@ -430,6 +432,17 @@ int wipe_page_reference_one(struct page *page,
out:
return ret;
+
+out_mlock:
+ if (refctx->is_page_locked &&
+ down_read_trylock(&vma->vm_mm->mmap_sem)) {
+ if (vma->vm_flags & VM_LOCKED) {
+ mlock_vma_page(page);
+ ret = SWAP_MLOCK;
+ }
+ up_read(&vma->vm_mm->mmap_sem);
+ }
+ return ret;
}
static int wipe_page_reference_anon(struct page *page,
@@ -550,6 +563,7 @@ static int wipe_page_reference_file(struct page *page,
*
* SWAP_SUCCESS - success to wipe all ptes
* SWAP_AGAIN - temporary busy, try again later
+ * SWAP_MLOCK - the page is mlocked
*/
int wipe_page_reference(struct page *page,
struct mem_cgroup *memcg,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c235059..4738a12 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -625,6 +625,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
ret = wipe_page_reference(page, sc->mem_cgroup, &refctx);
if (ret == SWAP_AGAIN)
goto keep_locked;
+ else if (ret == SWAP_MLOCK)
+ goto cull_mlocked;
VM_BUG_ON(ret != SWAP_SUCCESS);
/*
--
1.6.5.2
Changelog
o from v1
- Fix comments.
- Rename too_many_young_bit_found() with too_many_referenced()
[as Rik's mention].
o from andrea's original patch
- Rebase topon my patches.
- Use list_cut_position/list_splice_tail pair instead
list_del/list_add to make pte scan fairness.
- Only use max young threshold when soft_try is true.
It avoid wrong OOM sideeffect.
- Return SWAP_AGAIN instead successful result if max
young threshold exceed. It prevent the pages without clear
pte young bit will be deactivated wrongly.
- Add to treat ksm page logic
Many shared and frequently used page don't need deactivate and
try_to_unamp(). It's pointless while VM pressure is low, the page
might reactivate soon. it's only makes cpu wasting.
Then, This patch makes to stop pte scan if wipe_page_reference()
found lots young pte bit.
Signed-off-by: KOSAKI Motohiro <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
---
include/linux/rmap.h | 18 ++++++++++++++++++
mm/ksm.c | 4 ++++
mm/rmap.c | 19 +++++++++++++++++++
3 files changed, 41 insertions(+), 0 deletions(-)
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 499972e..ddf2578 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -128,6 +128,24 @@ int wipe_page_reference_one(struct page *page,
struct page_reference_context *refctx,
struct vm_area_struct *vma, unsigned long address);
+#define MAX_YOUNG_BIT_CLEARED 64
+/*
+ * If VM pressure is low and the page has lots of active users, we only
+ * clear up to MAX_YOUNG_BIT_CLEARED accessed bits at a time. Clearing
+ * accessed bits takes CPU time, needs TLB invalidate IPIs and could
+ * cause lock contention. Since a heavily shared page is very likely
+ * to be used again soon, the cost outweighs the benefit of making such
+ * a heavily shared page a candidate for eviction.
+ */
+static inline
+int too_many_referenced(struct page_reference_context *refctx)
+{
+ if (refctx->soft_try &&
+ refctx->referenced >= MAX_YOUNG_BIT_CLEARED)
+ return 1;
+ return 0;
+}
+
enum ttu_flags {
TTU_UNMAP = 0, /* unmap mode */
TTU_MIGRATION = 1, /* migration mode */
diff --git a/mm/ksm.c b/mm/ksm.c
index 19559ae..e959c41 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1586,6 +1586,10 @@ again:
rmap_item->address);
if (ret != SWAP_SUCCESS)
goto out;
+ if (too_many_referenced(refctx)) {
+ ret = SWAP_AGAIN;
+ goto out;
+ }
mapcount--;
if (!search_new_forks || !mapcount)
break;
diff --git a/mm/rmap.c b/mm/rmap.c
index cfda0a0..d66b8dc 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -473,6 +473,21 @@ static int wipe_page_reference_anon(struct page *page,
ret = wipe_page_reference_one(page, refctx, vma, address);
if (ret != SWAP_SUCCESS)
break;
+ if (too_many_referenced(refctx)) {
+ LIST_HEAD(tmp_list);
+
+ /*
+ * Rotating the anon vmas around help spread out lock
+ * pressure in the VM. It help to reduce heavy lock
+ * contention.
+ */
+ list_cut_position(&tmp_list,
+ &vma->anon_vma_node,
+ &anon_vma->head);
+ list_splice_tail(&tmp_list, &vma->anon_vma_node);
+ ret = SWAP_AGAIN;
+ break;
+ }
mapcount--;
if (!mapcount || refctx->maybe_mlocked)
break;
@@ -543,6 +558,10 @@ static int wipe_page_reference_file(struct page *page,
ret = wipe_page_reference_one(page, refctx, vma, address);
if (ret != SWAP_SUCCESS)
break;
+ if (too_many_referenced(refctx)) {
+ ret = SWAP_AGAIN;
+ break;
+ }
mapcount--;
if (!mapcount || refctx->maybe_mlocked)
break;
--
1.6.5.2
> Larry Woodman reported current VM has big performance regression when
> AIM7 benchmark. It use very much processes and fork/exit/page-fault frequently.
> (i.e. it makes serious lock contention of ptelock and anon_vma-lock.)
>
> At 2.6.28, we removed calc_reclaim_mapped() and it made vmscan, then
> vmscan always call page_referenced() although VM pressure is low.
> It increased lock contention more, unfortunately.
>
>
> Larry, can you please try this patch series on your big box?
this patches is made ontop mmotm1208.
KOSAKI Motohiro wrote:
> @@ -578,7 +577,9 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>
> + struct page_reference_context refctx = {
> + .is_page_locked = 1,
>
> *
> @@ -1289,7 +1291,6 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
>
> + struct page_reference_context refctx = {
> + .is_page_locked = 0,
> + };
> +
>
are these whole structs properly initialized on the kernel stack?
> KOSAKI Motohiro wrote:
> > @@ -578,7 +577,9 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> >
> > + struct page_reference_context refctx = {
> > + .is_page_locked = 1,
> >
> > *
> > @@ -1289,7 +1291,6 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
> >
> > + struct page_reference_context refctx = {
> > + .is_page_locked = 0,
> > + };
> > +
> >
> are these whole structs properly initialized on the kernel stack?
Yes. C spec says
3.5.7 Initialization
Syntax
initializer:
assignment-expression
{ initializer-list }
{ initializer-list , }
initializer-list:
initializer
initializer-list , initializer
(snip)
If there are fewer initializers in a list than there are members of
an aggregate, the remainder of the aggregate shall be initialized
implicitly the same as objects that have static storage duration.
Referenced to
Draft ANSI C Standard (ANSI X3J11/88-090) (May 13, 1988) http://flash-gordon.me.uk/ansi.c.txt
Probably, google "{0}" help your understand to initializer in C.
Hi, Kosaki.
On Thu, Dec 10, 2009 at 4:35 PM, KOSAKI Motohiro
<[email protected]> wrote:
> Changelog
> o from v1
> - Fix comments.
> - Rename too_many_young_bit_found() with too_many_referenced()
> [as Rik's mention].
> o from andrea's original patch
> - Rebase topon my patches.
> - Use list_cut_position/list_splice_tail pair instead
> list_del/list_add to make pte scan fairness.
> - Only use max young threshold when soft_try is true.
> It avoid wrong OOM sideeffect.
> - Return SWAP_AGAIN instead successful result if max
> young threshold exceed. It prevent the pages without clear
> pte young bit will be deactivated wrongly.
> - Add to treat ksm page logic
>
> Many shared and frequently used page don't need deactivate and
> try_to_unamp(). It's pointless while VM pressure is low, the page
> might reactivate soon. it's only makes cpu wasting.
>
> Then, This patch makes to stop pte scan if wipe_page_reference()
> found lots young pte bit.
>
> Signed-off-by: KOSAKI Motohiro <[email protected]>
> Reviewed-by: Rik van Riel <[email protected]>
> ---
> include/linux/rmap.h | 18 ++++++++++++++++++
> mm/ksm.c | 4 ++++
> mm/rmap.c | 19 +++++++++++++++++++
> 3 files changed, 41 insertions(+), 0 deletions(-)
>
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index 499972e..ddf2578 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -128,6 +128,24 @@ int wipe_page_reference_one(struct page *page,
> struct page_reference_context *refctx,
> struct vm_area_struct *vma, unsigned long address);
>
> +#define MAX_YOUNG_BIT_CLEARED 64
> +/*
This idea is good at embedded system which don't have access bit by hardware.
Such system emulates access bit as minor page fault AFAIK.
It means when VM clears young bit, kernel mark page table as non-permission
or something for refaulting.
So when next touch happens that address, kernel can do young bit set again.
It would be rather costly operation than one which have access bit by hardware.
So this idea is good in embedded system.
But 64 is rather big. many embedded system don't have many processes.
So I want to scale this number according to memory size like
inactive_raio for example.
Thanks for good idea and effort. :)
--
Kind regards,
Minchan Kim