Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753780AbaAUIRQ (ORCPT ); Tue, 21 Jan 2014 03:17:16 -0500 Received: from LGEMRELSE1Q.lge.com ([156.147.1.111]:58153 "EHLO LGEMRELSE1Q.lge.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751048AbaAUIRO (ORCPT ); Tue, 21 Jan 2014 03:17:14 -0500 X-AuditID: 9c93016f-b7b1aae000000f15-97-52de2d053484 Date: Tue, 21 Jan 2014 17:18:20 +0900 From: Minchan Kim To: Cai Liu Cc: Seth Jennings , Bob Liu , Cai Liu , Andrew Morton , Linux-MM , Linux-Kernel Subject: Re: [PATCH v2] mm/zswap: Check all pool pages instead of one pool pages Message-ID: <20140121081820.GA31230@bbox> References: <20140121050439.GA16664@bbox> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) X-Brightmail-Tracker: AAAAAA== Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hello, On Tue, Jan 21, 2014 at 02:35:07PM +0800, Cai Liu wrote: > 2014/1/21 Minchan Kim : > > Please check your MUA and don't break thread. > > > > On Tue, Jan 21, 2014 at 11:07:42AM +0800, Cai Liu wrote: > >> Thanks for your review. > >> > >> 2014/1/21 Minchan Kim : > >> > Hello Cai, > >> > > >> > On Mon, Jan 20, 2014 at 03:50:18PM +0800, Cai Liu wrote: > >> >> zswap can support multiple swapfiles. So we need to check > >> >> all zbud pool pages in zswap. > >> >> > >> >> Version 2: > >> >> * add *total_zbud_pages* in zbud to record all the pages in pools > >> >> * move the updating of pool pages statistics to > >> >> alloc_zbud_page/free_zbud_page to hide the details > >> >> > >> >> Signed-off-by: Cai Liu > >> >> --- > >> >> include/linux/zbud.h | 2 +- > >> >> mm/zbud.c | 44 ++++++++++++++++++++++++++++++++------------ > >> >> mm/zswap.c | 4 ++-- > >> >> 3 files changed, 35 insertions(+), 15 deletions(-) > >> >> > >> >> diff --git a/include/linux/zbud.h b/include/linux/zbud.h > >> >> index 2571a5c..1dbc13e 100644 > >> >> --- a/include/linux/zbud.h > >> >> +++ b/include/linux/zbud.h > >> >> @@ -17,6 +17,6 @@ void zbud_free(struct zbud_pool *pool, unsigned long handle); > >> >> int zbud_reclaim_page(struct zbud_pool *pool, unsigned int retries); > >> >> void *zbud_map(struct zbud_pool *pool, unsigned long handle); > >> >> void zbud_unmap(struct zbud_pool *pool, unsigned long handle); > >> >> -u64 zbud_get_pool_size(struct zbud_pool *pool); > >> >> +u64 zbud_get_pool_size(void); > >> >> > >> >> #endif /* _ZBUD_H_ */ > >> >> diff --git a/mm/zbud.c b/mm/zbud.c > >> >> index 9451361..711aaf4 100644 > >> >> --- a/mm/zbud.c > >> >> +++ b/mm/zbud.c > >> >> @@ -52,6 +52,13 @@ > >> >> #include > >> >> #include > >> >> > >> >> +/********************************* > >> >> +* statistics > >> >> +**********************************/ > >> >> + > >> >> +/* zbud pages in all pools */ > >> >> +static u64 total_zbud_pages; > >> >> + > >> >> /***************** > >> >> * Structures > >> >> *****************/ > >> >> @@ -142,10 +149,28 @@ static struct zbud_header *init_zbud_page(struct page *page) > >> >> return zhdr; > >> >> } > >> >> > >> >> +static struct page *alloc_zbud_page(struct zbud_pool *pool, gfp_t gfp) > >> >> +{ > >> >> + struct page *page; > >> >> + > >> >> + page = alloc_page(gfp); > >> >> + > >> >> + if (page) { > >> >> + pool->pages_nr++; > >> >> + total_zbud_pages++; > >> > > >> > Who protect race? > >> > >> Yes, here the pool->pages_nr and also the total_zbud_pages are not protected. > >> I will re-do it. > >> > >> I will change *total_zbud_pages* to atomic type. > > > > Wait, it doesn't make sense. Now, you assume zbud allocator would be used > > for only zswap. It's true until now but we couldn't make sure it in future. > > If other user start to use zbud allocator, total_zbud_pages would be pointless. > > Yes, you are right. ZBUD is a common module. So in this patch calculate the > zswap pool size in zbud is not suitable. > > > > > Another concern is that what's your scenario for above two swap? > > How often we need to call zbud_get_pool_size? > > In previous your patch, you reduced the number of call so IIRC, > > we only called it in zswap_is_full and for debugfs. > > zbud_get_pool_size() is called frequently when adding/freeing zswap > entry happen in zswap . This is why in this patch I added a counter in zbud, > and then in zswap the iteration of zswap_list to calculate the pool size will > not be needed. We can remove updating zswap_pool_pages in zswap_frontswap_store and zswap_free_entry as I said. So zswap_is_full is only hot spot. Do you think it's still big overhead? Why? Maybe locking to prevent destroying? Then, we can use RCU to minimize the overhead as I mentioned. > > > Of course, it would need some lock or refcount to prevent destroy > > of zswap_tree in parallel with zswap_frontswap_invalidate_area but > > zswap_is_full doesn't need to be exact so RCU would be good fit. > > > > Most important point is that now zswap doesn't consider multiple swap. > > For example, Let's assume you uses two swap A and B with different priority > > and A already has charged 19% long time ago and let's assume that A swap is > > full now so VM start to use B so that B has charged 1% recently. > > It menas zswap charged (19% + 1%)i is full by default. > > > > Then, if VM want to swap out more pages into B, zbud_reclaim_page > > would be evict one of pages in B's pool and it would be repeated > > continuously. It's totally LRU reverse problem and swap thrashing in B > > would happen. > > > > The scenario is below: > There are 2 swap A, B in system. If pool size of A reach 19% of ram > size and swap A > is also full. Then swap B will be used. Pool size of B will be > increased until it hit > the 20% of the ram size. By now zswap pool size is about 39% of ram size. > If there are more than 2 swap file/device, zswap pool will expand out > of control > and there may be no swapout happened. I know. > > I think the original intention of zswap designer is to keep the total > zswap pools size below > 20% of RAM size. My point is your patch still doesn't solve the example I mentioned. > > Thanks. > > > Please say your usecase scenario and if it's really problem, > > we need more surgery. > > > > Thanks. > > > >> For *pool->pages_nr*, one way is to use pool->lock to protect. But I > >> think it is too heavy. > >> So does it ok to change pages_nr to atomic type too? > >> > >> > >> > > >> >> + } > >> >> + > >> >> + return page; > >> >> +} > >> >> + > >> >> + > >> >> /* Resets the struct page fields and frees the page */ > >> >> -static void free_zbud_page(struct zbud_header *zhdr) > >> >> +static void free_zbud_page(struct zbud_pool *pool, struct zbud_header *zhdr) > >> >> { > >> >> __free_page(virt_to_page(zhdr)); > >> >> + > >> >> + pool->pages_nr--; > >> >> + total_zbud_pages--; > >> >> } > >> >> > >> >> /* > >> >> @@ -279,11 +304,10 @@ int zbud_alloc(struct zbud_pool *pool, int size, gfp_t gfp, > >> >> > >> >> /* Couldn't find unbuddied zbud page, create new one */ > >> >> spin_unlock(&pool->lock); > >> >> - page = alloc_page(gfp); > >> >> + page = alloc_zbud_page(pool, gfp); > >> >> if (!page) > >> >> return -ENOMEM; > >> >> spin_lock(&pool->lock); > >> >> - pool->pages_nr++; > >> >> zhdr = init_zbud_page(page); > >> >> bud = FIRST; > >> >> > >> >> @@ -349,8 +373,7 @@ void zbud_free(struct zbud_pool *pool, unsigned long handle) > >> >> if (zhdr->first_chunks == 0 && zhdr->last_chunks == 0) { > >> >> /* zbud page is empty, free */ > >> >> list_del(&zhdr->lru); > >> >> - free_zbud_page(zhdr); > >> >> - pool->pages_nr--; > >> >> + free_zbud_page(pool, zhdr); > >> >> } else { > >> >> /* Add to unbuddied list */ > >> >> freechunks = num_free_chunks(zhdr); > >> >> @@ -447,8 +470,7 @@ next: > >> >> * Both buddies are now free, free the zbud page and > >> >> * return success. > >> >> */ > >> >> - free_zbud_page(zhdr); > >> >> - pool->pages_nr--; > >> >> + free_zbud_page(pool, zhdr); > >> >> spin_unlock(&pool->lock); > >> >> return 0; > >> >> } else if (zhdr->first_chunks == 0 || > >> >> @@ -496,14 +518,12 @@ void zbud_unmap(struct zbud_pool *pool, unsigned long handle) > >> >> > >> >> /** > >> >> * zbud_get_pool_size() - gets the zbud pool size in pages > >> >> - * @pool: pool whose size is being queried > >> >> * > >> >> - * Returns: size in pages of the given pool. The pool lock need not be > >> >> - * taken to access pages_nr. > >> >> + * Returns: size in pages of all the zbud pools. > >> >> */ > >> >> -u64 zbud_get_pool_size(struct zbud_pool *pool) > >> >> +u64 zbud_get_pool_size(void) > >> >> { > >> >> - return pool->pages_nr; > >> >> + return total_zbud_pages; > >> >> } > >> >> > >> >> static int __init init_zbud(void) > >> >> diff --git a/mm/zswap.c b/mm/zswap.c > >> >> index 5a63f78..ef44d9d 100644 > >> >> --- a/mm/zswap.c > >> >> +++ b/mm/zswap.c > >> >> @@ -291,7 +291,7 @@ static void zswap_free_entry(struct zswap_tree *tree, > >> >> zbud_free(tree->pool, entry->handle); > >> >> zswap_entry_cache_free(entry); > >> >> atomic_dec(&zswap_stored_pages); > >> >> - zswap_pool_pages = zbud_get_pool_size(tree->pool); > >> >> + zswap_pool_pages = zbud_get_pool_size(); > >> >> } > >> >> > >> >> /* caller must hold the tree lock */ > >> >> @@ -716,7 +716,7 @@ static int zswap_frontswap_store(unsigned type, pgoff_t offset, > >> >> > >> >> /* update stats */ > >> >> atomic_inc(&zswap_stored_pages); > >> >> - zswap_pool_pages = zbud_get_pool_size(tree->pool); > >> >> + zswap_pool_pages = zbud_get_pool_size(); > >> >> > >> >> return 0; > >> >> > >> >> -- > >> >> 1.7.10.4 > >> >> > >> >> -- > >> >> To unsubscribe, send a message with 'unsubscribe linux-mm' in > >> >> the body to majordomo@kvack.org. For more info on Linux MM, > >> >> see: http://www.linux-mm.org/ . > >> >> Don't email: email@kvack.org > >> > > >> > -- > >> > Kind regards, > >> > Minchan Kim > >> > >> -- > >> To unsubscribe, send a message with 'unsubscribe linux-mm' in > >> the body to majordomo@kvack.org. For more info on Linux MM, > >> see: http://www.linux-mm.org/ . > >> Don't email: email@kvack.org > > > > -- > > Kind regards, > > Minchan Kim > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org -- Kind regards, Minchan Kim -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/