Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754955AbZFBWju (ORCPT ); Tue, 2 Jun 2009 18:39:50 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1752422AbZFBWjm (ORCPT ); Tue, 2 Jun 2009 18:39:42 -0400 Received: from cmpxchg.org ([85.214.51.133]:58859 "EHLO cmpxchg.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752216AbZFBWjm (ORCPT ); Tue, 2 Jun 2009 18:39:42 -0400 Date: Wed, 3 Jun 2009 00:37:39 +0200 From: Johannes Weiner To: Andrew Morton Cc: Rik van Riel , Peter Zijlstra , Hugh Dickins , Andi Kleen , linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: [patch][v2] swap: virtual swap readahead Message-ID: <20090602223738.GA15475@cmpxchg.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 8342 Lines: 222 Hi Andrew, I redid the qsbench runs with a bigger page cluster (2^4). It shows improvement on both versions, the patched one still performing better. Rik hinted me that we can make the default even bigger when we are better at avoiding reading unrelated pages. I am currently testing this. Here are the timings for 2^4 (i.e. twice the) ra pages: vanilla: 1 x 2048M [20 runs] user 101.41/101.06 [1.42] system 11.02/10.83 [0.92] real 368.44/361.31 [48.47] 2 x 1024M [20 runs] user 101.42/101.23 [0.66] system 12.98/13.01 [0.56] real 338.45/338.56 [2.94] 4 x 540M [20 runs] user 101.75/101.62 [1.03] system 10.05/9.52 [1.53] real 371.97/351.88 [77.69] 8 x 280M [20 runs] user 103.35/103.33 [0.63] system 9.80/9.59 [1.72] real 453.48/473.21 [115.61] 16 x 128M [20 runs] user 91.04/91.00 [0.86] system 8.95/9.41 [2.06] real 312.16/342.29 [100.53] vswapra: 1 x 2048M [20 runs] user 98.47/98.32 [1.33] system 9.85/9.90 [0.92] real 373.95/382.64 [26.77] 2 x 1024M [20 runs] user 96.89/97.00 [0.44] system 9.52/9.48 [1.49] real 288.43/281.55 [53.12] 4 x 540M [20 runs] user 98.74/98.70 [0.92] system 7.62/7.83 [1.25] real 291.15/296.94 [54.85] 8 x 280M [20 runs] user 100.68/100.59 [0.53] system 7.59/7.62 [0.41] real 305.12/311.29 [26.09] 16 x 128M [20 runs] user 88.67/88.50 [1.02] system 6.06/6.22 [0.72] real 205.29/221.65 [42.06] Furthermore I changed the patch to leave shmem alone for now and added documentation for the new approach. And I adjusted the changelog a bit. Andi, I think the NUMA policy is already taken care of. Can you have another look at it? Other than that you gave positive feedback - can I add your acked-by? Hannes --- The current swap readahead implementation reads a physically contiguous group of swap slots around the faulting page to take advantage of the disk head's position and in the hope that the surrounding pages will be needed soon as well. This works as long as the physical swap slot order approximates the LRU order decently, otherwise it wastes memory and IO bandwidth to read in pages that are unlikely to be needed soon. However, the physical swap slot layout diverges from the LRU order with increasing swap activity, i.e. high memory pressure situations, and this is exactly the situation where swapin should not waste any memory or IO bandwidth as both are the most contended resources at this point. Another approximation for LRU-relation is the VMA order as groups of VMA-related pages are usually used together. This patch combines both the physical and the virtual hint to get a good approximation of pages that are sensible to read ahead. When both diverge, we either read unrelated data, seek heavily for related data, or, what this patch does, just decrease the readahead efforts. To achieve this, we have essentially two readahead windows of the same size: one spans the virtual, the other one the physical neighborhood of the faulting page. We only read where both areas overlap. Signed-off-by: Johannes Weiner Reviewed-by: Rik van Riel Cc: Hugh Dickins Cc: Andi Kleen --- mm/swap_state.c | 115 ++++++++++++++++++++++++++++++++++++++++++++++++-------- 1 file changed, 99 insertions(+), 16 deletions(-) version 2: o fall back to physical ra window for shmem o add documentation to the new ra algorithm qsbench, 20 runs, 1.7GB RAM, 2GB swap, "mean (standard deviation) median": vanilla vswapra 1 x 2048M 391.25 ( 71.76) 384.56 445.55 (83.19) 415.41 2 x 1024M 384.25 ( 75.00) 423.08 290.26 (31.38) 299.51 4 x 540M 553.91 (100.02) 554.57 336.58 (52.49) 331.52 8 x 280M 561.08 ( 82.36) 583.12 319.13 (43.17) 307.69 16 x 128M 285.51 (113.20) 236.62 214.24 (62.37) 214.15 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -325,27 +325,14 @@ struct page *read_swap_cache_async(swp_e return found_page; } -/** - * swapin_readahead - swap in pages in hope we need them soon - * @entry: swap entry of this memory - * @gfp_mask: memory allocation flags - * @vma: user vma this address belongs to - * @addr: target address for mempolicy - * - * Returns the struct page for entry and addr, after queueing swapin. - * +/* * Primitive swap readahead code. We simply read an aligned block of * (1 << page_cluster) entries in the swap area. This method is chosen * because it doesn't cost us any seek time. We also make sure to queue * the 'original' request together with the readahead ones... - * - * This has been extended to use the NUMA policies from the mm triggering - * the readahead. - * - * Caller must hold down_read on the vma->vm_mm if vma is not NULL. */ -struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask, - struct vm_area_struct *vma, unsigned long addr) +static struct page *swapin_readahead_phys(swp_entry_t entry, gfp_t gfp_mask, + struct vm_area_struct *vma, unsigned long addr) { int nr_pages; struct page *page; @@ -371,3 +358,99 @@ struct page *swapin_readahead(swp_entry_ lru_add_drain(); /* Push any new pages onto the LRU now */ return read_swap_cache_async(entry, gfp_mask, vma, addr); } + +/** + * swapin_readahead - swap in pages in hope we need them soon + * @entry: swap entry of this memory + * @gfp_mask: memory allocation flags + * @vma: user vma this address belongs to + * @addr: target address for mempolicy + * + * Returns the struct page for entry and addr, after queueing swapin. + * + * The readahead window is the virtual area around the faulting page, + * where the physical proximity of the swap slots is taken into + * account as well. + * + * While the swap allocation algorithm tries to keep LRU-related pages + * together on the swap backing, it is not reliable on heavy thrashing + * systems where concurrent reclaimers allocate swap slots and/or most + * anonymous memory pages are already in swap cache. + * + * On the virtual side, subgroups of VMA-related pages are usually + * used together, which gives another hint to LRU relationship. + * + * By taking both aspects into account, we get a good approximation of + * which pages are sensible to read together with the faulting one. + * + * This has been extended to use the NUMA policies from the mm + * triggering the readahead. + * + * Caller must hold down_read on the vma->vm_mm if vma is not NULL. + */ +struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask, + struct vm_area_struct *vma, unsigned long addr) +{ + unsigned long start, pos, end; + unsigned long pmin, pmax; + int cluster, window; + + if (!vma || !vma->vm_mm) /* XXX: shmem case */ + return swapin_readahead_phys(entry, gfp_mask, vma, addr); + + cluster = 1 << page_cluster; + window = cluster << PAGE_SHIFT; + + /* Physical range to read from */ + pmin = swp_offset(entry) & ~(cluster - 1); + pmax = pmin + cluster; + + /* Virtual range to read from */ + start = addr & ~(window - 1); + end = start + window; + + for (pos = start; pos < end; pos += PAGE_SIZE) { + struct page *page; + swp_entry_t swp; + spinlock_t *ptl; + pgd_t *pgd; + pud_t *pud; + pmd_t *pmd; + pte_t *pte; + + pgd = pgd_offset(vma->vm_mm, pos); + if (!pgd_present(*pgd)) + continue; + pud = pud_offset(pgd, pos); + if (!pud_present(*pud)) + continue; + pmd = pmd_offset(pud, pos); + if (!pmd_present(*pmd)) + continue; + pte = pte_offset_map_lock(vma->vm_mm, pmd, pos, &ptl); + if (!is_swap_pte(*pte)) { + pte_unmap_unlock(pte, ptl); + continue; + } + swp = pte_to_swp_entry(*pte); + pte_unmap_unlock(pte, ptl); + + if (swp_type(swp) != swp_type(entry)) + continue; + /* + * Dont move the disk head too far away. This also + * throttles readahead while thrashing, where virtual + * order diverges more and more from physical order. + */ + if (swp_offset(swp) > pmax) + continue; + if (swp_offset(swp) < pmin) + continue; + page = read_swap_cache_async(swp, gfp_mask, vma, pos); + if (!page) + continue; + page_cache_release(page); + } + lru_add_drain(); /* Push any new pages onto the LRU now */ + return read_swap_cache_async(entry, gfp_mask, vma, addr); +} -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/