Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756293AbZFJGlc (ORCPT ); Wed, 10 Jun 2009 02:41:32 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1752608AbZFJGlZ (ORCPT ); Wed, 10 Jun 2009 02:41:25 -0400 Received: from fgwmail6.fujitsu.co.jp ([192.51.44.36]:46282 "EHLO fgwmail6.fujitsu.co.jp" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750783AbZFJGlY (ORCPT ); Wed, 10 Jun 2009 02:41:24 -0400 Date: Wed, 10 Jun 2009 15:39:52 +0900 From: KAMEZAWA Hiroyuki To: Johannes Weiner Cc: Andrew Morton , Rik van Riel , Hugh Dickins , Andi Kleen , Wu Fengguang , Minchan Kim , linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [patch v3] swap: virtual swap readahead Message-Id: <20090610153952.925eeb97.kamezawa.hiroyu@jp.fujitsu.com> In-Reply-To: <20090609193702.GA2017@cmpxchg.org> References: <20090609190128.GA1785@cmpxchg.org> <20090609193702.GA2017@cmpxchg.org> Organization: FUJITSU Co. LTD. X-Mailer: Sylpheed 2.5.0 (GTK+ 2.10.14; i686-pc-mingw32) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 11758 Lines: 310 On Tue, 9 Jun 2009 21:37:02 +0200 Johannes Weiner wrote: > On Tue, Jun 09, 2009 at 09:01:28PM +0200, Johannes Weiner wrote: > > [resend with lists cc'd, sorry] > > [and fixed Hugh's email. crap] > > > Hi, > > > > here is a new iteration of the virtual swap readahead. Per Hugh's > > suggestion, I moved the pte collecting to the callsite and thus out > > ouf swap code. Unfortunately, I had to bound page_cluster due to an > > array of that many swap entries on the stack, but I think it is better > > to limit the cluster size to a sane maximum than using dynamic > > allocation for this purpose. > > > > Thanks all for the helpful suggestions. KAMEZAWA-san and Minchan, I > > didn't incorporate your ideas in this patch as I think they belong in > > a different one with their own justifications. I didn't ignore them. > > > > Hannes > > > > --- > > The current swap readahead implementation reads a physically > > contiguous group of swap slots around the faulting page to take > > advantage of the disk head's position and in the hope that the > > surrounding pages will be needed soon as well. > > > > This works as long as the physical swap slot order approximates the > > LRU order decently, otherwise it wastes memory and IO bandwidth to > > read in pages that are unlikely to be needed soon. > > > > However, the physical swap slot layout diverges from the LRU order > > with increasing swap activity, i.e. high memory pressure situations, > > and this is exactly the situation where swapin should not waste any > > memory or IO bandwidth as both are the most contended resources at > > this point. > > > > Another approximation for LRU-relation is the VMA order as groups of > > VMA-related pages are usually used together. > > > > This patch combines both the physical and the virtual hint to get a > > good approximation of pages that are sensible to read ahead. > > > > When both diverge, we either read unrelated data, seek heavily for > > related data, or, what this patch does, just decrease the readahead > > efforts. > > > > To achieve this, we have essentially two readahead windows of the same > > size: one spans the virtual, the other one the physical neighborhood > > of the faulting page. We only read where both areas overlap. > > > > Signed-off-by: Johannes Weiner > > Reviewed-by: Rik van Riel > > Cc: Hugh Dickins > > Cc: Andi Kleen > > Cc: Wu Fengguang > > Cc: KAMEZAWA Hiroyuki > > Cc: Minchan Kim > > --- > > include/linux/swap.h | 4 ++- > > kernel/sysctl.c | 7 ++++- > > mm/memory.c | 55 +++++++++++++++++++++++++++++++++++++++++ > > mm/shmem.c | 4 +-- > > mm/swap_state.c | 67 ++++++++++++++++++++++++++++++++++++++------------- > > 5 files changed, 116 insertions(+), 21 deletions(-) > > > > version 3: > > o move pte selection to callee (per Hugh) > > o limit ra ptes to one pmd entry to avoid multiple > > locking/mapping of highptes (per Hugh) > > > > version 2: > > o fall back to physical ra window for shmem > > o add documentation to the new ra algorithm (per Andrew) > > > > --- a/mm/swap_state.c > > +++ b/mm/swap_state.c > > @@ -327,27 +327,14 @@ struct page *read_swap_cache_async(swp_e > > return found_page; > > } > > > > -/** > > - * swapin_readahead - swap in pages in hope we need them soon > > - * @entry: swap entry of this memory > > - * @gfp_mask: memory allocation flags > > - * @vma: user vma this address belongs to > > - * @addr: target address for mempolicy > > - * > > - * Returns the struct page for entry and addr, after queueing swapin. > > - * > > +/* > > * Primitive swap readahead code. We simply read an aligned block of > > * (1 << page_cluster) entries in the swap area. This method is chosen > > * because it doesn't cost us any seek time. We also make sure to queue > > * the 'original' request together with the readahead ones... > > - * > > - * This has been extended to use the NUMA policies from the mm triggering > > - * the readahead. > > - * > > - * Caller must hold down_read on the vma->vm_mm if vma is not NULL. > > */ > > -struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask, > > - struct vm_area_struct *vma, unsigned long addr) > > +static struct page *swapin_readahead_phys(swp_entry_t entry, gfp_t gfp_mask, > > + struct vm_area_struct *vma, unsigned long addr) > > { > > int nr_pages; > > struct page *page; > > @@ -373,3 +360,51 @@ struct page *swapin_readahead(swp_entry_ > > lru_add_drain(); /* Push any new pages onto the LRU now */ > > return read_swap_cache_async(entry, gfp_mask, vma, addr); > > } > > + > > +/** > > + * swapin_readahead - swap in pages in hope we need them soon > > + * @entry: swap entry of this memory > > + * @gfp_mask: memory allocation flags > > + * @vma: user vma this address belongs to > > + * @addr: target address for mempolicy > > + * @entries: swap slots to consider reading > > + * @nr_entries: number of @entries > > + * @cluster: readahead window size in swap slots > > + * > > + * Returns the struct page for entry and addr, after queueing swapin. > > + * > > + * This has been extended to use the NUMA policies from the mm > > + * triggering the readahead. > > + * > > + * Caller must hold down_read on the vma->vm_mm if vma is not NULL. > > + */ > > +struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask, > > + struct vm_area_struct *vma, unsigned long addr, > > + swp_entry_t *entries, int nr_entries, > > + unsigned long cluster) > > +{ > > + unsigned long pmin, pmax; > > + int i; > > + > > + if (!entries) /* XXX: shmem case */ > > + return swapin_readahead_phys(entry, gfp_mask, vma, addr); > > + pmin = swp_offset(entry) & ~(cluster - 1); > > + pmax = pmin + cluster; > > + for (i = 0; i < nr_entries; i++) { > > + swp_entry_t swp = entries[i]; > > + struct page *page; > > + > > + if (swp_type(swp) != swp_type(entry)) > > + continue; > > + if (swp_offset(swp) > pmax) > > + continue; > > + if (swp_offset(swp) < pmin) > > + continue; > > + page = read_swap_cache_async(swp, gfp_mask, vma, addr); > > + if (!page) > > + break; > > + page_cache_release(page); > > + } > > + lru_add_drain(); /* Push any new pages onto the LRU now */ > > + return read_swap_cache_async(entry, gfp_mask, vma, addr); > > +} > > --- a/include/linux/swap.h > > +++ b/include/linux/swap.h > > @@ -292,7 +292,9 @@ extern struct page *lookup_swap_cache(sw > > extern struct page *read_swap_cache_async(swp_entry_t, gfp_t, > > struct vm_area_struct *vma, unsigned long addr); > > extern struct page *swapin_readahead(swp_entry_t, gfp_t, > > - struct vm_area_struct *vma, unsigned long addr); > > + struct vm_area_struct *vma, unsigned long addr, > > + swp_entry_t *entries, int nr_entries, > > + unsigned long cluster); > > > > /* linux/mm/swapfile.c */ > > extern long nr_swap_pages; > > --- a/mm/memory.c > > +++ b/mm/memory.c > > @@ -2440,6 +2440,54 @@ int vmtruncate_range(struct inode *inode > > } > > > > /* > > + * The readahead window is the virtual area around the faulting page, > > + * where the physical proximity of the swap slots is taken into > > + * account as well in swapin_readahead(). > > + * > > + * While the swap allocation algorithm tries to keep LRU-related pages > > + * together on the swap backing, it is not reliable on heavy thrashing > > + * systems where concurrent reclaimers allocate swap slots and/or most > > + * anonymous memory pages are already in swap cache. > > + * > > + * On the virtual side, subgroups of VMA-related pages are usually > > + * used together, which gives another hint to LRU relationship. > > + * > > + * By taking both aspects into account, we get a good approximation of > > + * which pages are sensible to read together with the faulting one. > > + */ > > +static int swap_readahead_ptes(struct mm_struct *mm, > > + unsigned long addr, pmd_t *pmd, > > + swp_entry_t *entries, > > + unsigned long cluster) > > +{ > > + unsigned long window, min, max, limit; > > + spinlock_t *ptl; > > + pte_t *ptep; > > + int i, nr; > > + > > + window = cluster << PAGE_SHIFT; > > + min = addr & ~(window - 1); > > + max = min + cluster; Hmm, max = min + window ? Thanks, -Kame > > + /* > > + * To keep the locking/highpte mapping simple, stay > > + * within the PTE range of one PMD entry. > > + */ > > + limit = addr & PMD_MASK; > > + if (limit > min) > > + min = limit; > > + limit = pmd_addr_end(addr, max); > > + if (limit < max) > > + max = limit; > > + limit = max - min; > > + ptep = pte_offset_map_lock(mm, pmd, min, &ptl); > > + for (i = nr = 0; i < limit; i++) > > + if (is_swap_pte(ptep[i])) > > + entries[nr++] = pte_to_swp_entry(ptep[i]); > > + pte_unmap_unlock(ptep, ptl); > > + return nr; > > +} > > + > > +/* > > * We enter with non-exclusive mmap_sem (to exclude vma changes, > > * but allow concurrent faults), and pte mapped but not yet locked. > > * We return with mmap_sem still held, but pte unmapped and unlocked. > > @@ -2466,9 +2514,14 @@ static int do_swap_page(struct mm_struct > > delayacct_set_flag(DELAYACCT_PF_SWAPIN); > > page = lookup_swap_cache(entry); > > if (!page) { > > + int nr, cluster = 1 << page_cluster; > > + swp_entry_t entries[cluster]; > > + > > grab_swap_token(); /* Contend for token _before_ read-in */ > > + nr = swap_readahead_ptes(mm, address, pmd, entries, cluster); > > page = swapin_readahead(entry, > > - GFP_HIGHUSER_MOVABLE, vma, address); > > + GFP_HIGHUSER_MOVABLE, vma, address, > > + entries, nr, cluster); > > if (!page) { > > /* > > * Back out if somebody else faulted in this pte > > --- a/mm/shmem.c > > +++ b/mm/shmem.c > > @@ -1148,7 +1148,7 @@ static struct page *shmem_swapin(swp_ent > > pvma.vm_pgoff = idx; > > pvma.vm_ops = NULL; > > pvma.vm_policy = spol; > > - page = swapin_readahead(entry, gfp, &pvma, 0); > > + page = swapin_readahead(entry, gfp, &pvma, 0, NULL, 0, 0); > > return page; > > } > > > > @@ -1178,7 +1178,7 @@ static inline void shmem_show_mpol(struc > > static inline struct page *shmem_swapin(swp_entry_t entry, gfp_t gfp, > > struct shmem_inode_info *info, unsigned long idx) > > { > > - return swapin_readahead(entry, gfp, NULL, 0); > > + return swapin_readahead(entry, gfp, NULL, 0, NULL, 0, 0); > > } > > > > static inline struct page *shmem_alloc_page(gfp_t gfp, > > --- a/kernel/sysctl.c > > +++ b/kernel/sysctl.c > > @@ -112,6 +112,8 @@ static int min_percpu_pagelist_fract = 8 > > > > static int ngroups_max = NGROUPS_MAX; > > > > +static int page_cluster_max = 5; > > + > > #ifdef CONFIG_MODULES > > extern char modprobe_path[]; > > #endif > > @@ -966,7 +968,10 @@ static struct ctl_table vm_table[] = { > > .data = &page_cluster, > > .maxlen = sizeof(int), > > .mode = 0644, > > - .proc_handler = &proc_dointvec, > > + .proc_handler = &proc_dointvec_minmax, > > + .strategy = &sysctl_intvec, > > + .extra1 = &zero, > > + .extra2 = &page_cluster_max, > > }, > > { > > .ctl_name = VM_DIRTY_BACKGROUND, > > > > -- > > To unsubscribe, send a message with 'unsubscribe linux-mm' in > > the body to majordomo@kvack.org. For more info on Linux MM, > > see: http://www.linux-mm.org/ . > > Don't email: email@kvack.org > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/