Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754738Ab3JIPag (ORCPT ); Wed, 9 Oct 2013 11:30:36 -0400 Received: from mail-ob0-f169.google.com ([209.85.214.169]:60036 "EHLO mail-ob0-f169.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754523Ab3JIPad (ORCPT ); Wed, 9 Oct 2013 11:30:33 -0400 Date: Wed, 9 Oct 2013 10:30:22 -0500 From: Seth Jennings To: Krzysztof Kozlowski Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Andrew Morton , Bob Liu , Mel Gorman , Bartlomiej Zolnierkiewicz , Marek Szyprowski , Tomasz Stanislawski , Kyungmin Park , Dave Hansen , Minchan Kim Subject: Re: [PATCH v3 5/6] zswap: replace tree in zswap with radix tree in zbud Message-ID: <20131009153022.GB5406@variantweb.net> References: <1381238980-2491-1-git-send-email-k.kozlowski@samsung.com> <1381238980-2491-6-git-send-email-k.kozlowski@samsung.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1381238980-2491-6-git-send-email-k.kozlowski@samsung.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3756 Lines: 85 On Tue, Oct 08, 2013 at 03:29:39PM +0200, Krzysztof Kozlowski wrote: > This allows migration of zbud pages. > > Add radix tree to zbud replacing the red-black tree in zswap. Use > offset as index to this tree so effectively the handle concept is not > needed anymore. Zswap uses only offset to access data stored in zbud. > > Functionality of red-black tree from zswap was merged into zbud's radix > tree. > > The patch changes the way of storing duplicated pages. Now zswap refused > to store them. > > This change also exposes and fixes race condition between: > - zbud_reclaim_page() (called from zswap_frontswap_store()) > and > - zbud_free() (called from zswap_frontswap_invalidate_page()). > This race was present already but additional locking and in-direct use > of handle makes it frequent during high memory pressure. > > Race typically looks like: > - thread 1: zbud_reclaim_page() > - thread 1: zswap_writeback_entry() > - zbud_map() > - thread 0: zswap_frontswap_invalidate_page() > - zbud_free() > - thread 1: read zswap_entry from memory or call zbud_unmap(), now on > invalid memory address > > The zbud_reclaim_page() calls evict handler (zswap_writeback_entry()) > without holding pool lock. The zswap_writeback_entry() reads > memory under address obtained from zbud_map() without any lock held. > If invalidate happens during this time the zbud_free() will remove handle > from the tree and zbud_unmap() won't succeed. > > The new map_count fields in zbud_header try to address this problem by > protecting handles from freeing. > > Still are some things to do in this patch: > 1. Accept storing of duplicated pages (as it was in original zswap). > 2. Use RCU for radix tree reads and updates. > 3. Optimize locking in zbud_free_all(). > 4. Iterate over LRU list instead of radix tree in zbud_free_all(). I started working on this in parallel to see if we come up with the same solutions. In many places we did :) A few places where we did things differently: I see you changed the first|last size in zbud from a size in chunks to a size in bytes and then proceed to do size_to_chunks() in many places. I think we should keep the size in chunks and track the exact size of the entry at the zswap level in the struct zswap_header before the compressed page data inside the zbud allocation. Doing that also does away with the struct zbud_mapped_entry argument in the zbud_map() call used to return both the address and the length. Now we can just return the address and the zswap layer can determine the length from it's own zswap_header inside the mapped data. In my approach, I was also looking at allowing the zbud pools to use HIGHMEM pages, since the handle is no longer an address. This requires the pages that are being mapped to be kmapped (atomic) which will disable preemption. This isn't an additional overhead since the map/unmap corresponds with a compress/decompress operation at the zswap level which uses per-cpu variables that disable preemption already. With preemption disabled during the map, a per-cpu variable can store any current zbud mapping at the zbud layer. This would eliminate the lookup by offset in zbud_unmap() and possibly do away the race condition you mention and remove the need for per-buddy mapcounts which is kinda messy. Once we get this sorted, the next step in my mind is to teach the MM about zbud pages in the reclaim page so we can do away with the LRU logic in zbud. Thanks again for all your work! Seth -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/