Received: by 2002:a05:6a10:206:0:0:0:0 with SMTP id 6csp3116544pxj; Mon, 14 Jun 2021 15:01:14 -0700 (PDT) X-Google-Smtp-Source: ABdhPJx1/lke7izumq4a+Xv3Mf84kSoa5wQjT2vaOy0LC8xrnMYysgTQkousKviDoEtUbrTzPR/w X-Received: by 2002:a17:906:3395:: with SMTP id v21mr17036092eja.102.1623708073924; Mon, 14 Jun 2021 15:01:13 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1623708073; cv=none; d=google.com; s=arc-20160816; b=kSR4Wc6+3hp6QUG4Y/xAUtnGT2JhPBhu/8Pj5DIGkfxrRebI/eEpWIIknxnamGnG5x i73usXtwaZxz81eZTa7VCiaDtwjgaYaS5V34QpFAoJnQNOIfcskPPk/0xNrIYvoOxb/P CUe9aFLf2gXvNAwe2nU52BfvkDM2MMLHgoijiP/1Fibjy1W2ss/B3k0EvjmFO8aGg9Zg ZPLeHPineNpXiKPs1+I51BQOcLxkQdZpRYXDk0KeqMJl2rrCzot/Y0tb0jJt1yS4bKTd IShXrf81Z+/EowIhPItJJui8OqFBx8avrpgUM3Jh5Ds+hqbxiOP4Q1lb9yC+kznjnmop EfHA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:subject:cc:to:from:date :dkim-signature; bh=nlRsshEbstZ85KNvrM86P1/6RfPC25bEz8PfYeGjfug=; b=IuXmgHQQcQWIjqCkGJAN6Fw3g0aQN4zyvM7OB7sWP0tP6CEyOy572BQi3UJ7DlaYn0 aDOSE/WGsLSBxhLDvM2IE4XmBksJ+E4gqoKwdFZpulxnS7yptBjFa/O1fPodm00iFeoM 9iukRQaKS3nuZlJunfRt9VD+9iIEs+NhlJZcx7MBjw7d0Sta3cJoBHWkeDQvb/9pQGD+ PMzUURqQxtTxeJwmY7WVhvrDKb7AnpqXMMisMyH87eD65McSm/eTk+oWeaI9wioexWPg C/FHGz3qCAgR9XOyZThxp2LjOv04nqAmQkaDepmZiXYZczG6v2lK6GtPDaZ0h87EoDbY Hl7g== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linux-foundation.org header.s=korg header.b=YwSajIk1; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id w4si12177999edc.146.2021.06.14.15.00.50; Mon, 14 Jun 2021 15:01:13 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@linux-foundation.org header.s=korg header.b=YwSajIk1; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231192AbhFNWBQ (ORCPT + 99 others); Mon, 14 Jun 2021 18:01:16 -0400 Received: from mail.kernel.org ([198.145.29.99]:38130 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229728AbhFNWBQ (ORCPT ); Mon, 14 Jun 2021 18:01:16 -0400 Received: by mail.kernel.org (Postfix) with ESMTPSA id 812B0611AB; Mon, 14 Jun 2021 21:59:12 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=linux-foundation.org; s=korg; t=1623707952; bh=pWkuBglv4mJnySzxhM7bwezoUP6TooTYCM8jj9C/mMc=; h=Date:From:To:Cc:Subject:In-Reply-To:References:From; b=YwSajIk1y4sYoZY1UBFqXv0/yZ/AiCakTI2i2mjtLiUBMYJ3Nx370PZ08rJnPm7iL rnjeiU+qqZujCFrnI8IKr47zRkSzzon0gjd3rA8tzgplkkxe+ox7gjb4HPbKJPjzD4 7p/WfcR9LrREx8ocKOPJeaO+YA9ngbBjiKFSCgYs= Date: Mon, 14 Jun 2021 14:59:12 -0700 From: Andrew Morton To: Johannes Weiner Cc: Roman Gushchin , Tejun Heo , linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, kernel-team@fb.com, Dave Chinner Subject: Re: [PATCH 4/4] vfs: keep inodes with page cache off the inode shrinker LRU Message-Id: <20210614145912.feb751df928f38476048ec15@linux-foundation.org> In-Reply-To: <20210614211904.14420-4-hannes@cmpxchg.org> References: <20210614211904.14420-1-hannes@cmpxchg.org> <20210614211904.14420-4-hannes@cmpxchg.org> X-Mailer: Sylpheed 3.5.1 (GTK+ 2.24.31; x86_64-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, 14 Jun 2021 17:19:04 -0400 Johannes Weiner wrote: > Historically (pre-2.5), the inode shrinker used to reclaim only empty > inodes and skip over those that still contained page cache. This > caused problems on highmem hosts: struct inode could put fill lowmem > zones before the cache was getting reclaimed in the highmem zones. > > To address this, the inode shrinker started to strip page cache to > facilitate reclaiming lowmem. However, this comes with its own set of > problems: the shrinkers may drop actively used page cache just because > the inodes are not currently open or dirty - think working with a > large git tree. It further doesn't respect cgroup memory protection > settings and can cause priority inversions between containers. > > Nowadays, the page cache also holds non-resident info for evicted > cache pages in order to detect refaults. We've come to rely heavily on > this data inside reclaim for protecting the cache workingset and > driving swap behavior. We also use it to quantify and report workload > health through psi. The latter in turn is used for fleet health > monitoring, as well as driving automated memory sizing of workloads > and containers, proactive reclaim and memory offloading schemes. > > The consequences of dropping page cache prematurely is that we're > seeing subtle and not-so-subtle failures in all of the above-mentioned > scenarios, with the workload generally entering unexpected thrashing > states while losing the ability to reliably detect it. > > To fix this on non-highmem systems at least, going back to rotating > inodes on the LRU isn't feasible. We've tried (commit a76cf1a474d7 > ("mm: don't reclaim inodes with many attached pages")) and failed > (commit 69056ee6a8a3 ("Revert "mm: don't reclaim inodes with many > attached pages"")). The issue is mostly that shrinker pools attract > pressure based on their size, and when objects get skipped the > shrinkers remember this as deferred reclaim work. This accumulates > excessive pressure on the remaining inodes, and we can quickly eat > into heavily used ones, or dirty ones that require IO to reclaim, when > there potentially is plenty of cold, clean cache around still. > > Instead, this patch keeps populated inodes off the inode LRU in the > first place - just like an open file or dirty state would. An > otherwise clean and unused inode then gets queued when the last cache > entry disappears. This solves the problem without reintroducing the > reclaim issues, and generally is a bit more scalable than having to > wade through potentially hundreds of thousands of busy inodes. > > Locking is a bit tricky because the locks protecting the inode state > (i_lock) and the inode LRU (lru_list.lock) don't nest inside the > irq-safe page cache lock (i_pages.xa_lock). Page cache deletions are > serialized through i_lock, taken before the i_pages lock, to make sure > depopulated inodes are queued reliably. Additions may race with > deletions, but we'll check again in the shrinker. If additions race > with the shrinker itself, we're protected by the i_lock: if > find_inode() or iput() win, the shrinker will bail on the elevated > i_count or I_REFERENCED; if the shrinker wins and goes ahead with the > inode, it will set I_FREEING and inhibit further igets(), which will > cause the other side to create a new instance of the inode instead. > And what hitherto unexpected problems will this one cause, sigh. How exhaustively has this approach been tested?