Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932473AbaBEByo (ORCPT ); Tue, 4 Feb 2014 20:54:44 -0500 Received: from zene.cmpxchg.org ([85.214.230.12]:57433 "EHLO zene.cmpxchg.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755157AbaBEByl (ORCPT ); Tue, 4 Feb 2014 20:54:41 -0500 Date: Tue, 4 Feb 2014 20:53:52 -0500 From: Johannes Weiner To: Andrew Morton Cc: Andi Kleen , Andrea Arcangeli , Bob Liu , Christoph Hellwig , Dave Chinner , Greg Thelen , Hugh Dickins , Jan Kara , KOSAKI Motohiro , Luigi Semenzato , Mel Gorman , Metin Doslu , Michel Lespinasse , Minchan Kim , Ozgun Erdogan , Peter Zijlstra , Rik van Riel , Roman Gushchin , Ryan Mallon , Tejun Heo , Vlastimil Babka , linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [patch 10/10] mm: keep page cache radix tree nodes in check Message-ID: <20140205015352.GW6963@cmpxchg.org> References: <1391475222-1169-1-git-send-email-hannes@cmpxchg.org> <1391475222-1169-11-git-send-email-hannes@cmpxchg.org> <20140204150756.d7f46af4385026ce61c89c55@linux-foundation.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20140204150756.d7f46af4385026ce61c89c55@linux-foundation.org> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Feb 04, 2014 at 03:07:56PM -0800, Andrew Morton wrote: > On Mon, 3 Feb 2014 19:53:42 -0500 Johannes Weiner wrote: > > > Previously, page cache radix tree nodes were freed after reclaim > > emptied out their page pointers. But now reclaim stores shadow > > entries in their place, which are only reclaimed when the inodes > > themselves are reclaimed. This is problematic for bigger files that > > are still in use after they have a significant amount of their cache > > reclaimed, without any of those pages actually refaulting. The shadow > > entries will just sit there and waste memory. In the worst case, the > > shadow entries will accumulate until the machine runs out of memory. > > > > To get this under control, the VM will track radix tree nodes > > exclusively containing shadow entries on a per-NUMA node list. > > Per-NUMA rather than global because we expect the radix tree nodes > > themselves to be allocated node-locally and we want to reduce > > cross-node references of otherwise independent cache workloads. A > > simple shrinker will then reclaim these nodes on memory pressure. ^^^^^^^^^^^^^^^ > > A few things need to be stored in the radix tree node to implement the > > shadow node LRU and allow tree deletions coming from the list: > > > > 1. There is no index available that would describe the reverse path > > from the node up to the tree root, which is needed to perform a > > deletion. To solve this, encode in each node its offset inside the > > parent. This can be stored in the unused upper bits of the same > > member that stores the node's height at no extra space cost. > > > > 2. The number of shadow entries needs to be counted in addition to the > > regular entries, to quickly detect when the node is ready to go to > > the shadow node LRU list. The current entry count is an unsigned > > int but the maximum number of entries is 64, so a shadow counter > > can easily be stored in the unused upper bits. > > > > 3. Tree modification needs tree lock and tree root, which are located > > in the address space, so store an address_space backpointer in the > > node. The parent pointer of the node is in a union with the 2-word > > rcu_head, so the backpointer comes at no extra cost as well. > > > > 4. The node needs to be linked to an LRU list, which requires a list > > head inside the node. This does increase the size of the node, but > > it does not change the number of objects that fit into a slab page. > > changelog forgot to mention that this reclaim is performed via a > shrinker... Uhm... see above? :) > How expensive is that list walk in scan_shadow_nodes()? I assume in > the best case it will bale out after nr_to_scan iterations? Yes, it scans sc->nr_to_scan radix tree nodes, cleans their pointers, and frees them. I ran a worst-case scenario on an 8G machine that creates one 8T sparse file and faults one page per 64-page radix tree node, i.e. one node per sparse file fault at CPU speed. The profile: 1 9.21% radixblow [kernel.kallsyms] [k] memset 2 7.23% radixblow [kernel.kallsyms] [k] do_mpage_readpage 3 4.76% radixblow [kernel.kallsyms] [k] copy_user_generic_string 4 3.85% radixblow [kernel.kallsyms] [k] __radix_tree_lookup 5 3.32% kswapd0 [kernel.kallsyms] [k] shadow_lru_isolate 6 2.92% radixblow [kernel.kallsyms] [k] get_page_from_freelist 7 2.81% kswapd0 [kernel.kallsyms] [k] __delete_from_page_cache 8 2.50% radixblow [kernel.kallsyms] [k] radix_tree_node_ctor 9 1.79% radixblow [kernel.kallsyms] [k] _raw_spin_lock_irq 10 1.70% kswapd0 [kernel.kallsyms] [k] __mem_cgroup_uncharge_common Same scenario with 4 pages per 64-page radix tree node: 13 1.39% kswapd0 [kernel.kallsyms] [k] shadow_lru_isolate 16 pages per 64-page node: 75 0.20% kswapd0 [kernel.kallsyms] [k] shadow_lru_isolate So I doubt this will bother anyone, especially since most use-once streamers should have a better population density and populate cache at disk speed, not CPU speed. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/