Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756171AbaDHFgl (ORCPT ); Tue, 8 Apr 2014 01:36:41 -0400 Received: from cn.fujitsu.com ([59.151.112.132]:56494 "EHLO heian.cn.fujitsu.com" rhost-flags-OK-FAIL-OK-FAIL) by vger.kernel.org with ESMTP id S1750785AbaDHFgj (ORCPT ); Tue, 8 Apr 2014 01:36:39 -0400 X-IronPort-AV: E=Sophos;i="4.97,815,1389715200"; d="scan'208";a="28983033" Message-ID: <53438A9D.7040104@cn.fujitsu.com> Date: Tue, 8 Apr 2014 13:35:25 +0800 From: Zhang Yanfei User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20131030 Thunderbird/17.0.10 MIME-Version: 1.0 To: Minchan Kim CC: Andrew Morton , , , Dave Hansen , John Stultz , Hugh Dickins , Johannes Weiner , Rik van Riel , KOSAKI Motohiro , Mel Gorman , Jason Evans Subject: Re: [PATCH v3] support madvise(MADV_FREE) References: <1396839098-11936-1-git-send-email-minchan@kernel.org> In-Reply-To: <1396839098-11936-1-git-send-email-minchan@kernel.org> Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 7bit X-Originating-IP: [10.167.226.197] Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hello Minchan, Nice work. It seems I forgot to reply your mail in the first RFC version of the patch, so I just review this new one. On 04/07/2014 10:51 AM, Minchan Kim wrote: > Linux doesn't have an ability to free pages lazy while other OS > already have been supported that named by madvise(MADV_FREE). > > The gain is clear that kernel can discard freed pages rather than > swapping out or OOM if memory pressure happens. > > Without memory pressure, freed pages would be reused by userspace > without another additional overhead(ex, page fault + allocation > + zeroing). Just from this, I cannot see how you handle the 'lazyfree' pages and how to distinguish 'lazyfree' and 'nonlazyfree' pages. Anyway, I've known these from the patch itself, so I know them, but it is difficult for someone who wants to get the knowledge from the log. So could you add a little explanation into this log to help understanding the code, especially you have put all the changes into one patch? > > Firstly, heavy users would be general allocators(ex, jemalloc, > tcmalloc and hope glibc supports it) and jemalloc/tcmalloc already > have supported the feature for other OS(ex, FreeBSD) > > barrios@blaptop:~/benchmark/ebizzy$ lscpu > Architecture: x86_64 > CPU op-mode(s): 32-bit, 64-bit > Byte Order: Little Endian > CPU(s): 4 > On-line CPU(s) list: 0-3 > Thread(s) per core: 2 > Core(s) per socket: 2 > Socket(s): 1 > NUMA node(s): 1 > Vendor ID: GenuineIntel > CPU family: 6 > Model: 42 > Stepping: 7 > CPU MHz: 2801.000 > BogoMIPS: 5581.64 > Virtualization: VT-x > L1d cache: 32K > L1i cache: 32K > L2 cache: 256K > L3 cache: 4096K > NUMA node0 CPU(s): 0-3 > > ebizzy benchmark(./ebizzy -S 10 -n 512) > > vanilla-jemalloc MADV_free-jemalloc > > 1 thread > records: 10 records: 10 > avg: 7436.70 avg: 15292.70 > std: 48.01(0.65%) std: 496.40(3.25%) > max: 7542.00 max: 15944.00 > min: 7366.00 min: 14478.00 > > 2 thread > records: 10 records: 10 > avg: 12190.50 avg: 24975.50 > std: 1011.51(8.30%) std: 1127.22(4.51%) > max: 13012.00 max: 26382.00 > min: 10192.00 min: 23265.00 > > 4 thread > records: 10 records: 10 > avg: 16875.30 avg: 36320.90 > std: 562.59(3.33%) std: 1503.75(4.14%) > max: 17465.00 max: 38314.00 > min: 15552.00 min: 33863.00 > > 8 thread > records: 10 records: 10 > avg: 16966.80 avg: 35915.20 > std: 229.35(1.35%) std: 2153.89(6.00%) > max: 17456.00 max: 37943.00 > min: 16742.00 min: 29891.00 > > 16 thread > records: 10 records: 10 > avg: 20590.90 avg: 37388.40 > std: 362.33(1.76%) std: 1282.59(3.43%) > max: 20954.00 max: 38911.00 > min: 19985.00 min: 34928.00 > > 32 thread > records: 10 records: 10 > avg: 22633.40 avg: 37118.00 > std: 413.73(1.83%) std: 766.36(2.06%) > max: 23120.00 max: 38328.00 > min: 22071.00 min: 35557.00 > > In summary, MADV_FREE is about 2 time faster than MADV_DONTNEED. > > Patchset is based on 3.14 > > * From v2 > * Remove forceful dirty marking of swap-readed page - Johannes > * Remove deactivation logic of lazyfreed page > * Rebased on 3.14 > * Remove RFC tag > > * From v1 > * Use custom page table walker for madvise_free - Johannes > * Remove PG_lazypage flag - Johannes > * Do madvise_dontneed instead of madvise_freein swapless system > > Cc: Hugh Dickins > Cc: Johannes Weiner > Cc: Rik van Riel > Cc: KOSAKI Motohiro > Cc: Mel Gorman > Cc: Jason Evans > Signed-off-by: Minchan Kim > --- > include/linux/mm.h | 2 + > include/linux/rmap.h | 14 +++- > include/linux/vm_event_item.h | 1 + > include/uapi/asm-generic/mman-common.h | 1 + > mm/madvise.c | 25 ++++++ > mm/memory.c | 141 +++++++++++++++++++++++++++++++++ > mm/rmap.c | 62 +++++++++++++-- > mm/vmscan.c | 30 ++++++- > mm/vmstat.c | 1 + > 9 files changed, 267 insertions(+), 10 deletions(-) > > diff --git a/include/linux/mm.h b/include/linux/mm.h > index c1b7414c7bef..79af90212c19 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -1063,6 +1063,8 @@ int zap_vma_ptes(struct vm_area_struct *vma, unsigned long address, > unsigned long size); > void zap_page_range(struct vm_area_struct *vma, unsigned long address, > unsigned long size, struct zap_details *); > +void lazyfree_range(struct vm_area_struct *vma, unsigned long address, > + unsigned long size); > void unmap_vmas(struct mmu_gather *tlb, struct vm_area_struct *start_vma, > unsigned long start, unsigned long end); > > diff --git a/include/linux/rmap.h b/include/linux/rmap.h > index b66c2110cb1f..1f3517027ffa 100644 > --- a/include/linux/rmap.h > +++ b/include/linux/rmap.h > @@ -182,7 +182,8 @@ static inline void page_dup_rmap(struct page *page) > * Called from mm/vmscan.c to handle paging out > */ > int page_referenced(struct page *, int is_locked, > - struct mem_cgroup *memcg, unsigned long *vm_flags); > + struct mem_cgroup *memcg, unsigned long *vm_flags, > + int *is_dirty); > int page_referenced_one(struct page *, struct vm_area_struct *, > unsigned long address, void *arg); > > @@ -235,6 +236,11 @@ struct anon_vma *page_lock_anon_vma_read(struct page *page); > void page_unlock_anon_vma_read(struct anon_vma *anon_vma); > int page_mapped_in_vma(struct page *page, struct vm_area_struct *vma); > > +struct rmap_private { > + enum ttu_flags flags; > + int pte_dirty; /* used for lazyfree */ > +}; > + > /* > * rmap_walk_control: To control rmap traversing for specific needs > * > @@ -265,9 +271,12 @@ int rmap_walk(struct page *page, struct rmap_walk_control *rwc); > > static inline int page_referenced(struct page *page, int is_locked, > struct mem_cgroup *memcg, > - unsigned long *vm_flags) > + unsigned long *vm_flags, > + int *is_pte_dirty) > { > *vm_flags = 0; > + if (is_pte_dirty) > + *is_pte_dirty = 0; > return 0; > } > > @@ -288,5 +297,6 @@ static inline int page_mkclean(struct page *page) > #define SWAP_AGAIN 1 > #define SWAP_FAIL 2 > #define SWAP_MLOCK 3 > +#define SWAP_DISCARD 4 > > #endif /* _LINUX_RMAP_H */ > diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h > index 3a712e2e7d76..a69680d335bb 100644 > --- a/include/linux/vm_event_item.h > +++ b/include/linux/vm_event_item.h > @@ -25,6 +25,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, > FOR_ALL_ZONES(PGALLOC), > PGFREE, PGACTIVATE, PGDEACTIVATE, > PGFAULT, PGMAJFAULT, > + PGLAZYFREED, > FOR_ALL_ZONES(PGREFILL), > FOR_ALL_ZONES(PGSTEAL_KSWAPD), > FOR_ALL_ZONES(PGSTEAL_DIRECT), > diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h > index 4164529a94f9..b7eedd33ebab 100644 > --- a/include/uapi/asm-generic/mman-common.h > +++ b/include/uapi/asm-generic/mman-common.h > @@ -34,6 +34,7 @@ > #define MADV_SEQUENTIAL 2 /* expect sequential page references */ > #define MADV_WILLNEED 3 /* will need these pages */ > #define MADV_DONTNEED 4 /* don't need these pages */ > +#define MADV_FREE 5 /* free pages only if memory pressure */ > > /* common parameters: try to keep these consistent across architectures */ > #define MADV_REMOVE 9 /* remove these pages & resources */ > diff --git a/mm/madvise.c b/mm/madvise.c > index 539eeb96b323..4a2aa96b2db3 100644 > --- a/mm/madvise.c > +++ b/mm/madvise.c > @@ -31,6 +31,7 @@ static int madvise_need_mmap_write(int behavior) > case MADV_REMOVE: > case MADV_WILLNEED: > case MADV_DONTNEED: > + case MADV_FREE: > return 0; > default: > /* be safe, default to 1. list exceptions explicitly */ > @@ -251,6 +252,22 @@ static long madvise_willneed(struct vm_area_struct *vma, > return 0; > } > > +static long madvise_lazyfree(struct vm_area_struct *vma, > + struct vm_area_struct **prev, > + unsigned long start, unsigned long end) > +{ > + *prev = vma; > + if (vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP)) > + return -EINVAL; > + > + /* MADV_FREE works for only anon vma at the moment */ > + if (vma->vm_file) > + return -EINVAL; > + > + lazyfree_range(vma, start, end - start); > + return 0; > +} > + > /* > * Application no longer needs these pages. If the pages are dirty, > * it's OK to just throw them away. The app will be more careful about > @@ -384,6 +401,13 @@ madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev, > return madvise_remove(vma, prev, start, end); > case MADV_WILLNEED: > return madvise_willneed(vma, prev, start, end); > + case MADV_FREE: > + /* > + * In this implementation, MADV_FREE works like MADV_DONTNEED > + * on swapless system or full swap. > + */ > + if (get_nr_swap_pages() > 0) > + return madvise_lazyfree(vma, prev, start, end); > case MADV_DONTNEED: > return madvise_dontneed(vma, prev, start, end); > default: > @@ -403,6 +427,7 @@ madvise_behavior_valid(int behavior) > case MADV_REMOVE: > case MADV_WILLNEED: > case MADV_DONTNEED: > + case MADV_FREE: > #ifdef CONFIG_KSM > case MADV_MERGEABLE: > case MADV_UNMERGEABLE: > diff --git a/mm/memory.c b/mm/memory.c > index 22dfa617bddb..7811faa8bbe3 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -1268,6 +1268,105 @@ static inline unsigned long zap_pud_range(struct mmu_gather *tlb, > return addr; > } > > +static unsigned long lazyfree_pte_range(struct mmu_gather *tlb, > + struct vm_area_struct *vma, pmd_t *pmd, > + unsigned long addr, unsigned long end) > +{ > + struct mm_struct *mm = tlb->mm; > + spinlock_t *ptl; > + pte_t *start_pte; > + pte_t *pte; > + > + start_pte = pte_offset_map_lock(mm, pmd, addr, &ptl); > + pte = start_pte; > + arch_enter_lazy_mmu_mode(); > + do { > + pte_t ptent = *pte; > + > + if (pte_none(ptent)) > + continue; > + > + if (!pte_present(ptent)) > + continue; > + > + ptent = pte_mkold(ptent); > + ptent = pte_mkclean(ptent); > + set_pte_at(mm, addr, pte, ptent); > + tlb_remove_tlb_entry(tlb, pte, addr); > + } while (pte++, addr += PAGE_SIZE, addr != end); > + arch_leave_lazy_mmu_mode(); > + pte_unmap_unlock(start_pte, ptl); > + > + return addr; > +} > + > +static inline unsigned long lazyfree_pmd_range(struct mmu_gather *tlb, > + struct vm_area_struct *vma, pud_t *pud, > + unsigned long addr, unsigned long end) > +{ > + pmd_t *pmd; > + unsigned long next; > + > + pmd = pmd_offset(pud, addr); > + do { > + next = pmd_addr_end(addr, end); > + /* TODO : optimize */ > + if (pmd_trans_huge(*pmd)) > + split_huge_page_pmd(vma, addr, pmd); > + /* > + * Here there can be other concurrent MADV_DONTNEED or > + * trans huge page faults running, and if the pmd is > + * none or trans huge it can change under us. This is > + * because MADV_LAZYFREE holds the mmap_sem in read > + * mode. > + */ > + if (pmd_none_or_trans_huge_or_clear_bad(pmd)) > + goto next; > + next = lazyfree_pte_range(tlb, vma, pmd, addr, next); > +next: > + cond_resched(); > + } while (pmd++, addr = next, addr != end); > + > + return addr; > +} > + > +static inline unsigned long lazyfree_pud_range(struct mmu_gather *tlb, > + struct vm_area_struct *vma, pgd_t *pgd, > + unsigned long addr, unsigned long end) > +{ > + pud_t *pud; > + unsigned long next; > + > + pud = pud_offset(pgd, addr); > + do { > + next = pud_addr_end(addr, end); > + if (pud_none_or_clear_bad(pud)) > + continue; > + next = lazyfree_pmd_range(tlb, vma, pud, addr, next); > + } while (pud++, addr = next, addr != end); > + > + return addr; > +} > + > +static void lazyfree_page_range(struct mmu_gather *tlb, > + struct vm_area_struct *vma, > + unsigned long addr, unsigned long end) > +{ > + pgd_t *pgd; > + unsigned long next; > + > + BUG_ON(addr >= end); > + tlb_start_vma(tlb, vma); > + pgd = pgd_offset(vma->vm_mm, addr); > + do { > + next = pgd_addr_end(addr, end); > + if (pgd_none_or_clear_bad(pgd)) > + continue; > + next = lazyfree_pud_range(tlb, vma, pgd, addr, next); > + } while (pgd++, addr = next, addr != end); > + tlb_end_vma(tlb, vma); > +} > + > static void unmap_page_range(struct mmu_gather *tlb, > struct vm_area_struct *vma, > unsigned long addr, unsigned long end, > @@ -1294,6 +1393,23 @@ static void unmap_page_range(struct mmu_gather *tlb, > } > > > +static void lazyfree_single_vma(struct mmu_gather *tlb, > + struct vm_area_struct *vma, unsigned long start_addr, > + unsigned long end_addr) > +{ > + unsigned long start = max(vma->vm_start, start_addr); > + unsigned long end; > + > + if (start >= vma->vm_end) > + return; > + end = min(vma->vm_end, end_addr); > + if (end <= vma->vm_start) > + return; > + > + if (start != end) > + lazyfree_page_range(tlb, vma, start, end); > +} > + > static void unmap_single_vma(struct mmu_gather *tlb, > struct vm_area_struct *vma, unsigned long start_addr, > unsigned long end_addr, > @@ -1368,6 +1484,31 @@ void unmap_vmas(struct mmu_gather *tlb, > } > > /** > + * lazyfree_range - clear dirty bit of pte in a given range > + * @vma: vm_area_struct holding the applicable pages > + * @start: starting address of pages > + * @size: number of bytes to do lazyfree > + * > + * Caller must protect the VMA list > + */ > +void lazyfree_range(struct vm_area_struct *vma, unsigned long start, > + unsigned long size) > +{ > + struct mm_struct *mm = vma->vm_mm; > + struct mmu_gather tlb; > + unsigned long end = start + size; > + > + lru_add_drain(); > + tlb_gather_mmu(&tlb, mm, start, end); > + update_hiwater_rss(mm); > + mmu_notifier_invalidate_range_start(mm, start, end); > + for ( ; vma && vma->vm_start < end; vma = vma->vm_next) > + lazyfree_single_vma(&tlb, vma, start, end); > + mmu_notifier_invalidate_range_end(mm, start, end); > + tlb_finish_mmu(&tlb, start, end); > +} > + > +/** > * zap_page_range - remove user pages in a given range > * @vma: vm_area_struct holding the applicable pages > * @start: starting address of pages to zap > diff --git a/mm/rmap.c b/mm/rmap.c > index 8fc049f9a5a6..ca0ebd28edfa 100644 > --- a/mm/rmap.c > +++ b/mm/rmap.c > @@ -661,6 +661,7 @@ int page_mapped_in_vma(struct page *page, struct vm_area_struct *vma) > } > > struct page_referenced_arg { > + int dirtied; > int mapcount; > int referenced; > unsigned long vm_flags; > @@ -675,6 +676,7 @@ int page_referenced_one(struct page *page, struct vm_area_struct *vma, > struct mm_struct *mm = vma->vm_mm; > spinlock_t *ptl; > int referenced = 0; > + int dirty = 0; > struct page_referenced_arg *pra = arg; > > if (unlikely(PageTransHuge(page))) { > @@ -727,6 +729,10 @@ int page_referenced_one(struct page *page, struct vm_area_struct *vma, > if (likely(!(vma->vm_flags & VM_SEQ_READ))) > referenced++; > } > + > + if (pte_dirty(*pte)) > + dirty++; > + > pte_unmap_unlock(pte, ptl); > } > > @@ -735,6 +741,9 @@ int page_referenced_one(struct page *page, struct vm_area_struct *vma, > pra->vm_flags |= vma->vm_flags; > } > > + if (dirty) > + pra->dirtied++; > + > pra->mapcount--; > if (!pra->mapcount) > return SWAP_SUCCESS; /* To break the loop */ > @@ -759,6 +768,7 @@ static bool invalid_page_referenced_vma(struct vm_area_struct *vma, void *arg) > * @is_locked: caller holds lock on the page > * @memcg: target memory cgroup > * @vm_flags: collect encountered vma->vm_flags who actually referenced the page > + * @is_pte_dirty: ptes which have marked dirty bit > * > * Quick test_and_clear_referenced for all mappings to a page, > * returns the number of ptes which referenced the page. > @@ -766,7 +776,8 @@ static bool invalid_page_referenced_vma(struct vm_area_struct *vma, void *arg) > int page_referenced(struct page *page, > int is_locked, > struct mem_cgroup *memcg, > - unsigned long *vm_flags) > + unsigned long *vm_flags, > + int *is_pte_dirty) > { > int ret; > int we_locked = 0; > @@ -781,6 +792,9 @@ int page_referenced(struct page *page, > }; > > *vm_flags = 0; > + if (is_pte_dirty) > + *is_pte_dirty = 0; > + > if (!page_mapped(page)) > return 0; > > @@ -808,6 +822,9 @@ int page_referenced(struct page *page, > if (we_locked) > unlock_page(page); > > + if (is_pte_dirty) > + *is_pte_dirty = pra.dirtied; > + > return pra.referenced; > } > > @@ -1120,7 +1137,9 @@ int try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > pte_t pteval; > spinlock_t *ptl; > int ret = SWAP_AGAIN; > - enum ttu_flags flags = (enum ttu_flags)arg; > + struct rmap_private *rp = (struct rmap_private *)arg; > + enum ttu_flags flags = rp->flags; > + int dirty = 0; > > pte = page_check_address(page, mm, address, &ptl, 0); > if (!pte) > @@ -1150,7 +1169,8 @@ int try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > pteval = ptep_clear_flush(vma, address, pte); > > /* Move the dirty bit to the physical page now the pte is gone. */ > - if (pte_dirty(pteval)) > + dirty = pte_dirty(pteval); > + if (dirty) > set_page_dirty(page); > > /* Update high watermark before we lower rss */ > @@ -1169,6 +1189,15 @@ int try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > swp_entry_t entry = { .val = page_private(page) }; > pte_t swp_pte; > > + if ((TTU_ACTION(flags) == TTU_UNMAP) && !dirty && > + !PageSwapCache(page) && !PageDirty(page)) { I see this !dirty && !PageSwapCache(page) && !PageDirty(page) test in several places, could be placed into a inline function with a appropriate name? > + dec_mm_counter(mm, MM_ANONPAGES); > + goto discard; > + } > + > + if (dirty) > + rp->pte_dirty++; > + > if (PageSwapCache(page)) { > /* > * Store the swap location in the pte. > @@ -1187,6 +1216,10 @@ int try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > } > dec_mm_counter(mm, MM_ANONPAGES); > inc_mm_counter(mm, MM_SWAPENTS); > + } else if (TTU_ACTION(flags) == TTU_UNMAP) { > + set_pte_at(mm, address, pte, pteval); > + ret = SWAP_FAIL; > + goto out_unmap; Can you elaborate this chage a bit? > } else if (IS_ENABLED(CONFIG_MIGRATION)) { > /* > * Store the pfn of the page in a special migration > @@ -1210,6 +1243,7 @@ int try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > } else > dec_mm_counter(mm, MM_FILEPAGES); > > +discard: > page_remove_rmap(page); > page_cache_release(page); > > @@ -1470,13 +1504,19 @@ static int page_not_mapped(struct page *page) > * SWAP_AGAIN - we missed a mapping, try again later > * SWAP_FAIL - the page is unswappable > * SWAP_MLOCK - page is mlocked. > + * SWAP_DISCARD - same with SWAP_SUCCESS but no need to swap out > */ > int try_to_unmap(struct page *page, enum ttu_flags flags) > { > int ret; > + > + struct rmap_private rp = { > + .flags = flags, > + }; > + > struct rmap_walk_control rwc = { > .rmap_one = try_to_unmap_one, > - .arg = (void *)flags, > + .arg = &rp, > .done = page_not_mapped, > .file_nonlinear = try_to_unmap_nonlinear, > .anon_lock = page_lock_anon_vma_read, > @@ -1497,8 +1537,14 @@ int try_to_unmap(struct page *page, enum ttu_flags flags) > > ret = rmap_walk(page, &rwc); > > - if (ret != SWAP_MLOCK && !page_mapped(page)) > + if (ret != SWAP_MLOCK && !page_mapped(page)) { > ret = SWAP_SUCCESS; > + if (TTU_ACTION(flags) == TTU_UNMAP && !rp.pte_dirty && > + PageAnon(page) && !PageSwapCache(page) && > + !PageDirty(page) && !PageKsm(page)) > + ret = SWAP_DISCARD; > + } > + > return ret; > } > > @@ -1520,9 +1566,13 @@ int try_to_unmap(struct page *page, enum ttu_flags flags) > int try_to_munlock(struct page *page) > { > int ret; > + struct rmap_private rp = { > + .flags = TTU_MUNLOCK, > + }; > + > struct rmap_walk_control rwc = { > .rmap_one = try_to_unmap_one, > - .arg = (void *)TTU_MUNLOCK, > + .arg = &rp, > .done = page_not_mapped, > /* > * We don't bother to try to find the munlocked page in > diff --git a/mm/vmscan.c b/mm/vmscan.c > index a9c74b409681..7f1c5a26bc41 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -684,6 +684,7 @@ enum page_references { > PAGEREF_RECLAIM_CLEAN, > PAGEREF_KEEP, > PAGEREF_ACTIVATE, > + PAGEREF_DISCARD, > }; > > static enum page_references page_check_references(struct page *page, > @@ -691,9 +692,12 @@ static enum page_references page_check_references(struct page *page, > { > int referenced_ptes, referenced_page; > unsigned long vm_flags; > + int is_pte_dirty; > + > + VM_BUG_ON_PAGE(!PageLocked(page), page); > > referenced_ptes = page_referenced(page, 1, sc->target_mem_cgroup, > - &vm_flags); > + &vm_flags, &is_pte_dirty); > referenced_page = TestClearPageReferenced(page); > > /* > @@ -734,6 +738,18 @@ static enum page_references page_check_references(struct page *page, > return PAGEREF_KEEP; > } > > + /* > + * We should check PageDirty because swap-in page by read fault > + * will be swapcache and pte point out the page doesn't have > + * dirty bit so only pte dirtiness check isn't enough. In this case, > + * it would be good to check PG_swapcache to filter it out. > + * If the page is removed from swapcache, it must have PG_dirty > + * so we should check it to prevent purging non-lazyfree page. > + */ Nice explanation. So this is the key point to know how to detect a lazyfree page, I think something like this can be put in the patch log, too. Thanks Zhang > + if (PageAnon(page) && !is_pte_dirty && > + !PageSwapCache(page) && !PageDirty(page)) > + return PAGEREF_DISCARD; > + > /* Reclaim if clean, defer dirty pages to writeback */ > if (referenced_page && !PageSwapBacked(page)) > return PAGEREF_RECLAIM_CLEAN; > @@ -932,6 +948,8 @@ static unsigned long shrink_page_list(struct list_head *page_list, > goto activate_locked; > case PAGEREF_KEEP: > goto keep_locked; > + case PAGEREF_DISCARD: > + goto discard; > case PAGEREF_RECLAIM: > case PAGEREF_RECLAIM_CLEAN: > ; /* try to reclaim the page below */ > @@ -957,6 +975,7 @@ static unsigned long shrink_page_list(struct list_head *page_list, > * processes. Try to unmap it here. > */ > if (page_mapped(page) && mapping) { > +discard: > switch (try_to_unmap(page, ttu_flags)) { > case SWAP_FAIL: > goto activate_locked; > @@ -964,6 +983,13 @@ static unsigned long shrink_page_list(struct list_head *page_list, > goto keep_locked; > case SWAP_MLOCK: > goto cull_mlocked; > + case SWAP_DISCARD: > + VM_BUG_ON_PAGE(PageSwapCache(page), page); > + if (!page_freeze_refs(page, 1)) > + goto keep_locked; > + __clear_page_locked(page); > + count_vm_event(PGLAZYFREED); > + goto free_it; > case SWAP_SUCCESS: > ; /* try to free the page below */ > } > @@ -1688,7 +1714,7 @@ static void shrink_active_list(unsigned long nr_to_scan, > } > > if (page_referenced(page, 0, sc->target_mem_cgroup, > - &vm_flags)) { > + &vm_flags, NULL)) { > nr_rotated += hpage_nr_pages(page); > /* > * Identify referenced, file-backed active pages and > diff --git a/mm/vmstat.c b/mm/vmstat.c > index def5dd2fbe61..2d80f7ed495d 100644 > --- a/mm/vmstat.c > +++ b/mm/vmstat.c > @@ -789,6 +789,7 @@ const char * const vmstat_text[] = { > > "pgfault", > "pgmajfault", > + "pglazyfreed", > > TEXTS_FOR_ZONES("pgrefill") > TEXTS_FOR_ZONES("pgsteal_kswapd") > -- Thanks. Zhang Yanfei -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/