Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753183Ab3C0Vla (ORCPT ); Wed, 27 Mar 2013 17:41:30 -0400 Received: from mail-da0-f43.google.com ([209.85.210.43]:39993 "EHLO mail-da0-f43.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751804Ab3C0Vl3 (ORCPT ); Wed, 27 Mar 2013 17:41:29 -0400 Date: Wed, 27 Mar 2013 14:41:07 -0700 (PDT) From: Hugh Dickins X-X-Sender: hugh@eggly.anvils To: Minchan Kim cc: Andrew Morton , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Dan Magenheimer , Seth Jennings , Nitin Gupta , Konrad Rzeszutek Wilk , Shaohua Li , Kamezawa Hiroyuki Subject: Re: [RFC] mm: remove swapcache page early In-Reply-To: <1364350932-12853-1-git-send-email-minchan@kernel.org> Message-ID: References: <1364350932-12853-1-git-send-email-minchan@kernel.org> User-Agent: Alpine 2.00 (LNX 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 8244 Lines: 206 On Wed, 27 Mar 2013, Minchan Kim wrote: > Swap subsystem does lazy swap slot free with expecting the page > would be swapped out again so we can't avoid unnecessary write. so we can avoid unnecessary write. > > But the problem in in-memory swap is that it consumes memory space > until vm_swap_full(ie, used half of all of swap device) condition > meet. It could be bad if we use multiple swap device, small in-memory swap > and big storage swap or in-memory swap alone. That is a very good realization: it's surprising that none of us thought of it before - no disrespect to you, well done, thank you. And I guess swap readahead is utterly unhelpful in this case too. > > This patch changes vm_swap_full logic slightly so it could free > swap slot early if the backed device is really fast. > For it, I used SWP_SOLIDSTATE but It might be controversial. But I strongly disagree with almost everything in your patch :) I disagree with addressing it in vm_swap_full(), I disagree that it can be addressed by device, I disagree that it has anything to do with SWP_SOLIDSTATE. This is not a problem with swapping to /dev/ram0 or to /dev/zram0, is it? In those cases, a fixed amount of memory has been set aside for swap, and it works out just like with disk block devices. The memory set aside may be wasted, but that is accepted upfront. Similarly, this is not a problem with swapping to SSD. There might or might not be other reasons for adjusting the vm_swap_full() logic for SSD or generally, but those have nothing to do with this issue. The problem here is peculiar to frontswap, and the variably sized memory behind it, isn't it? We are accustomed to using swap to free up memory by transferring its data to some other, cheaper but slower resource. But in the case of frontswap and zmem (I'll say that to avoid thinking through which backends are actually involved), it is not a cheaper and slower resource, but the very same memory we are trying to save: swap is stolen from the memory under reclaim, so any duplication becomes counter-productive (if we ignore cpu compression/decompression costs: I have no idea how fair it is to do so, but anyone who chooses zmem is prepared to pay some cpu price for that). And because it's a frontswap thing, we cannot decide this by device: frontswap may or may not stand in front of each device. There is no problem with swapcache duplicated on disk (until that area approaches being full or fragmented), but at the higher level we cannot see what is in zmem and what is on disk: we only want to free up the zmem dup. I believe the answer is for frontswap/zmem to invalidate the frontswap copy of the page (to free up the compressed memory when possible) and SetPageDirty on the PageUptodate PageSwapCache page when swapping in (setting page dirty so nothing will later go to read it from the unfreed location on backing swap disk, which was never written). We cannot rely on freeing the swap itself, because in general there may be multiple references to the swap, and we only satisfy the one which has faulted. It may or may not be a good idea to use rmap to locate the other places to insert pte in place of swap entry, to resolve them all at once; but we have chosen not to do so in the past, and there's no need for that, if the zmem gets invalidated and the swapcache page set dirty. Hugh > So let's add Ccing Shaohua and Hugh. > If it's a problem for SSD, I'd like to create new type SWP_INMEMORY > or something for z* family. > > Other problem is zram is block device so that it can set SWP_INMEMORY > or SWP_SOLIDSTATE easily(ie, actually, zram is already done) but > I have no idea to use it for frontswap. > > Any idea? > > Other optimize point is we remove it unconditionally when we > found it's exclusive when swap in happen. > It could help frontswap family, too. > What do you think about it? > > Cc: Hugh Dickins > Cc: Dan Magenheimer > Cc: Seth Jennings > Cc: Nitin Gupta > Cc: Konrad Rzeszutek Wilk > Cc: Shaohua Li > Signed-off-by: Minchan Kim > --- > include/linux/swap.h | 11 ++++++++--- > mm/memory.c | 3 ++- > mm/swapfile.c | 11 +++++++---- > mm/vmscan.c | 2 +- > 4 files changed, 18 insertions(+), 9 deletions(-) > > diff --git a/include/linux/swap.h b/include/linux/swap.h > index 2818a12..1f4df66 100644 > --- a/include/linux/swap.h > +++ b/include/linux/swap.h > @@ -359,9 +359,14 @@ extern struct page *swapin_readahead(swp_entry_t, gfp_t, > extern atomic_long_t nr_swap_pages; > extern long total_swap_pages; > > -/* Swap 50% full? Release swapcache more aggressively.. */ > -static inline bool vm_swap_full(void) > +/* > + * Swap 50% full or fast backed device? > + * Release swapcache more aggressively. > + */ > +static inline bool vm_swap_full(struct swap_info_struct *si) > { > + if (si->flags & SWP_SOLIDSTATE) > + return true; > return atomic_long_read(&nr_swap_pages) * 2 < total_swap_pages; > } > > @@ -405,7 +410,7 @@ mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent, bool swapout) > #define get_nr_swap_pages() 0L > #define total_swap_pages 0L > #define total_swapcache_pages() 0UL > -#define vm_swap_full() 0 > +#define vm_swap_full(si) 0 > > #define si_swapinfo(val) \ > do { (val)->freeswap = (val)->totalswap = 0; } while (0) > diff --git a/mm/memory.c b/mm/memory.c > index 705473a..1ca21a9 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -3084,7 +3084,8 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma, > mem_cgroup_commit_charge_swapin(page, ptr); > > swap_free(entry); > - if (vm_swap_full() || (vma->vm_flags & VM_LOCKED) || PageMlocked(page)) > + if (likely(PageSwapCache(page)) && (vm_swap_full(page_swap_info(page)) > + || (vma->vm_flags & VM_LOCKED) || PageMlocked(page))) > try_to_free_swap(page); > unlock_page(page); > if (page != swapcache) { > diff --git a/mm/swapfile.c b/mm/swapfile.c > index 1bee6fa..f9cc701 100644 > --- a/mm/swapfile.c > +++ b/mm/swapfile.c > @@ -293,7 +293,7 @@ checks: > scan_base = offset = si->lowest_bit; > > /* reuse swap entry of cache-only swap if not busy. */ > - if (vm_swap_full() && si->swap_map[offset] == SWAP_HAS_CACHE) { > + if (vm_swap_full(si) && si->swap_map[offset] == SWAP_HAS_CACHE) { > int swap_was_freed; > spin_unlock(&si->lock); > swap_was_freed = __try_to_reclaim_swap(si, offset); > @@ -382,7 +382,8 @@ scan: > spin_lock(&si->lock); > goto checks; > } > - if (vm_swap_full() && si->swap_map[offset] == SWAP_HAS_CACHE) { > + if (vm_swap_full(si) && > + si->swap_map[offset] == SWAP_HAS_CACHE) { > spin_lock(&si->lock); > goto checks; > } > @@ -397,7 +398,8 @@ scan: > spin_lock(&si->lock); > goto checks; > } > - if (vm_swap_full() && si->swap_map[offset] == SWAP_HAS_CACHE) { > + if (vm_swap_full(si) && > + si->swap_map[offset] == SWAP_HAS_CACHE) { > spin_lock(&si->lock); > goto checks; > } > @@ -763,7 +765,8 @@ int free_swap_and_cache(swp_entry_t entry) > * Also recheck PageSwapCache now page is locked (above). > */ > if (PageSwapCache(page) && !PageWriteback(page) && > - (!page_mapped(page) || vm_swap_full())) { > + (!page_mapped(page) || > + vm_swap_full(page_swap_info(page)))) { > delete_from_swap_cache(page); > SetPageDirty(page); > } > diff --git a/mm/vmscan.c b/mm/vmscan.c > index df78d17..145c59c 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -933,7 +933,7 @@ cull_mlocked: > > activate_locked: > /* Not a candidate for swapping, so reclaim swap space. */ > - if (PageSwapCache(page) && vm_swap_full()) > + if (PageSwapCache(page) && vm_swap_full(page_swap_info(page))) > try_to_free_swap(page); > VM_BUG_ON(PageActive(page)); > SetPageActive(page); > -- > 1.8.2 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/