Received: by 2002:ac0:a5a7:0:0:0:0:0 with SMTP id m36-v6csp210840imm; Thu, 12 Jul 2018 17:37:11 -0700 (PDT) X-Google-Smtp-Source: AAOMgpcFLVM9xuici5hl8r3MC60jIIu5lngmQGnjiSYP53SE85OJP6kG/tjY3t3Mv2EQNEglXJ+w X-Received: by 2002:a17:902:740b:: with SMTP id g11-v6mr4198929pll.85.1531442231085; Thu, 12 Jul 2018 17:37:11 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1531442231; cv=none; d=google.com; s=arc-20160816; b=YqAp7fg/1KFoXSsjhUfxLSZCV0D+kGz3SjlfehJyERFUa70vJHOdfnh/sZY8Xi5Pgq JKExLbrp7MzjJcOGmCAeitAAuU7n38ch3uppNr7Cw2yD42TP0j2AayjvCaULqREa54R4 6bkO3wGQiG8MBCoSgLhjINT4WZG8G+LtvfmQU28O7Wqmrn/z6EurTCrtEE1Lx2yRAvBF E2Ek/vcI6BXUfEvyt+AlJLbZagSDLK+ZUuZZzbkcqU8wqU79Wg1p6tmPuqEGLa92GoES LWRDprfRlcCwpfFAhSMjG7FoqYEY6DeOWx8ZKV2RfLPEdswGP9aWnZs6jkSXPfX9R/Hk 8n0g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:arc-authentication-results; bh=Q076R0saJWnxPNFnIPc7sxSN/qfXJqrKOt+mfXhK3YI=; b=ozyfYuQS2CJ+LDkWBaT+YZa2YQSsf9XAh4NMrU4Pzy4Fk/ECU22lLor9IKwHJ9rwz1 NUx9UTrSn0jJsJipO9er33wRqzZWwR4aTyErj4gj7x/iJNgRE715jqPhZoUkCNz1l0g3 q02BgOJLDrd5pxB6cTyvSbM0sIxyo93HOor8f2ur20gO28azI4EOsE04hpXJoLm5xNWT 7p/SwTH221v/VBlMGLf+elEgZfvem1nBAeldeA5G6mjOEWQRuiidvF/HI1hnsXtVHSue 5+H2Lcux7PzJfA4Aw4B+GwaaGV/RBTn2SH1w+N1cokZW77ZI2GlQvAQu49FQFvW6HbIg whMw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id 31-v6si22441957plk.49.2018.07.12.17.36.55; Thu, 12 Jul 2018 17:37:11 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2387866AbeGMAsY (ORCPT + 99 others); Thu, 12 Jul 2018 20:48:24 -0400 Received: from ipmailnode02.adl6.internode.on.net ([150.101.137.148]:11341 "EHLO ipmailnode02.adl6.internode.on.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2387798AbeGMAsW (ORCPT ); Thu, 12 Jul 2018 20:48:22 -0400 Received: from ppp59-167-129-252.static.internode.on.net (HELO dastard) ([59.167.129.252]) by ipmail02.adl6.internode.on.net with ESMTP; 13 Jul 2018 10:06:16 +0930 Received: from dave by dastard with local (Exim 4.80) (envelope-from ) id 1fdm4E-0005Aw-MI; Fri, 13 Jul 2018 10:36:14 +1000 Date: Fri, 13 Jul 2018 10:36:14 +1000 From: Dave Chinner To: James Bottomley Cc: Linus Torvalds , Matthew Wilcox , Waiman Long , Michal Hocko , Al Viro , Jonathan Corbet , "Luis R. Rodriguez" , Kees Cook , Linux Kernel Mailing List , linux-fsdevel , linux-mm , "open list:DOCUMENTATION" , Jan Kara , Paul McKenney , Andrew Morton , Ingo Molnar , Miklos Szeredi , Larry Woodman , "Wangkai (Kevin,C)" Subject: Re: [PATCH v6 0/7] fs/dcache: Track & limit # of negative dentries Message-ID: <20180713003614.GW2234@dastard> References: <9f24c043-1fca-ee86-d609-873a7a8f7a64@redhat.com> <1531330947.3260.13.camel@HansenPartnership.com> <18c5cbfe-403b-bb2b-1d11-19d324ec6234@redhat.com> <1531336913.3260.18.camel@HansenPartnership.com> <4d49a270-23c9-529f-f544-65508b6b53cc@redhat.com> <1531411494.18255.6.camel@HansenPartnership.com> <20180712164932.GA3475@bombadil.infradead.org> <1531416080.18255.8.camel@HansenPartnership.com> <1531425435.18255.17.camel@HansenPartnership.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1531425435.18255.17.camel@HansenPartnership.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Jul 12, 2018 at 12:57:15PM -0700, James Bottomley wrote: > What surprises me most about this behaviour is the steadiness of the > page cache ... I would have thought we'd have shrunk it somewhat given > the intense call on the dcache. Oh, good, the page cache vs superblock shrinker balancing still protects the working set of each cache the way it's supposed to under heavy single cache pressure. :) Keep in mind that the amount of work slab cache shrinkers perform is directly proportional to the amount of page cache reclaim that is performed and the size of the slab cache being reclaimed. IOWs, under a "single cache pressure" workload we should be directing reclaim work to the huge cache creating the pressure and do very little reclaim from other caches.... [ What follows from here is conjecture, but is based on what I've seen in the past 10+ years on systems with large numbers of negative dentries and fragmented dentry/inode caches. ] However, this only reaches steady state if the reclaim rate can keep ahead of the allocation rate. This single threaded micro-workload won't result in an internally fragmented dentry slab cache, so reclaim is going to be as efficient as possible and have the CPU to keep up with the allocation rate. i.e. Bulk negative dentry reclaim is cheap, in LRU order, and frees slab pages quickly and efficiently in large batches so steady state is easily reached. Problems arise when the slab *page* reclaim rate drops below allocation rate. i.e when you have short term (negative) dentries mixed into the same slab pages as long term stable dentries. This causes the dentry cache to fragment internally - reclaim hits the negative dentries and creates large numbers of partial pages - and so reclaim of negative dentries will fail to free memory. Creating new negative dentries then fills these partial pages first, and so the alloc/reclaim cycles on negative dentries only ever produce partial pages and never free slab cache pages. IOWs, the cost of reclaim slab *pages* goes way up despite the fact that the cost of reclaiming individual dentries has remained the same. That's the underlying problem here - the cost of reclaiming dentries is constant but the cost of reclaiming *slab pages* is not. It is not uncommon to have to trash 90% of the dentry or inode caches to reduce internal fragmentation down to the point where pages start to get freed and the per-slab-page reclaim cost reduces to be less than the allocation cost. Then we see the system return to normal steady state behaviour. In situations where lots of negative dentries are created by production workloads, that "90%" of the cache that needs to be reclaimed to fix the internal fragmentation issue is all negative dentries and just enough of the real dentries to be freeing quantities of partial pages in the slab. Hence negative dentries are seen as the problem because they make up the vast majority of the dentries that get reclaimed when the problem goes away. By limiting the number of negative dentries in this case, internal slab fragmentation is reduced such that reclaim cost never gets out of control. While it appears to "fix" the symptoms, it doesn't address the underlying problem. It is a partial solution at best but at worst it's another opaque knob that nobody knows how or when to tune. Very few microbenchmarks expose this internal slab fragmentation problem because they either don't run long enough, don't create memory pressure, or don't have access patterns that mix long and short term slab objects together in a way that causes slab fragmentation. Run some cold cache directory traversals (git status?) at the same time you are creating negative dentries so you create pinned partial pages in the slab cache and see how the behaviour changes.... Cheers, Dave. -- Dave Chinner david@fromorbit.com