Received: by 2002:ac0:aa62:0:0:0:0:0 with SMTP id w31-v6csp448682ima; Fri, 26 Oct 2018 00:34:27 -0700 (PDT) X-Google-Smtp-Source: AJdET5cERXJBrMBWJsiQMWtS3qOuuw3QxVNoF/r+M6KMXiz0JfeI/l0I0ICkJWDoHV0/as1QqCAR X-Received: by 2002:a63:5308:: with SMTP id h8-v6mr2377255pgb.358.1540539267182; Fri, 26 Oct 2018 00:34:27 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1540539267; cv=none; d=google.com; s=arc-20160816; b=e+4p0EsacpFD2vfgseodgOiXOahvcbjmHIolSYnCuSH1iVZJTMdBCoeGnUUjgjMrIg vACAIAMWK0JHVQjwzV2knmXyAWwUgvv+V98BetVm7VZVtg+jUQNHi1QH8oxWaHYmBxIL LmPihtlBLHQCTt2rerqWIWU02rhyOKLgCjwh2lhyszW0WmVcwCQKcEr3w1DVrfWkM2+9 Fxedvo6M+6VSu6nqeQJoa348vqeVA45oPiK1pWBzcw2a83ph4ccN6l+AWmthjtBVlqUZ z6c4iVhatznwt3CHZKjdflHt+qWyqblCC/gfaU2oNCBHuHDP+xUylfTXfJgj+6bGNQH5 pl0Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date; bh=mHyv3pz+gtXQQ2W+E/Q93RbEBWb1JoOYWQF/bZJrnOw=; b=aVOyaOu9tpRwG5wTXZq0fhcZ+ezrtYioPXTElxHw+HIWFyBKkHA3BItreE7SDwXN4T jTYDfbvGYOyPu9FXTCPa5N0Rs7JcTlQrrHC5YO6MqzuchA/gNfGwsDbYUkBq240NQc6w 6T3rDPbYdiy7Rj9Rd/YYk4+eH6S86ORYq9wSy2D3nxqM2hp0jf4TckiUyRSjaHqgrxp5 cEmqXx3pR81IlPFD1iZrOIDs294NrnMNgHoZBWpRXKqN/yF2xIreTsTaejWB1dC7CwDg ZB/lxK4+gi6C2UfV91j3ET1Ea9j/zCoQjtsOy64lKosN5jRiuriZLGjpSSFFsJ6BJYHw TVww== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id y1-v6si4144713pfa.47.2018.10.26.00.34.11; Fri, 26 Oct 2018 00:34:27 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727265AbeJZQJF (ORCPT + 99 others); Fri, 26 Oct 2018 12:09:05 -0400 Received: from mx2.suse.de ([195.135.220.15]:41732 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1726402AbeJZQJF (ORCPT ); Fri, 26 Oct 2018 12:09:05 -0400 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay1.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id 79C18ACC5; Fri, 26 Oct 2018 07:33:10 +0000 (UTC) Date: Fri, 26 Oct 2018 09:33:03 +0200 From: Michal Hocko To: Roman Gushchin Cc: Sasha Levin , Andrew Morton , "linux-mm@kvack.org" , "linux-kernel@vger.kernel.org" , Kernel Team , Rik van Riel , Randy Dunlap , Sasha Levin Subject: Re: [RFC PATCH] mm: don't reclaim inodes with many attached pages Message-ID: <20181026073303.GW18839@dhcp22.suse.cz> References: <20181023164302.20436-1-guro@fb.com> <20181024151950.36fe2c41957d807756f587ca@linux-foundation.org> <20181025092352.GP18839@dhcp22.suse.cz> <20181025124442.5513d282273786369bbb7460@linux-foundation.org> <20181025202014.GA216405@sasha-vm> <20181025203240.GA2504@tower.DHCP.thefacebook.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20181025203240.GA2504@tower.DHCP.thefacebook.com> User-Agent: Mutt/1.10.1 (2018-07-13) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu 25-10-18 20:32:47, Roman Gushchin wrote: > On Thu, Oct 25, 2018 at 04:20:14PM -0400, Sasha Levin wrote: > > On Thu, Oct 25, 2018 at 12:44:42PM -0700, Andrew Morton wrote: > > > On Thu, 25 Oct 2018 11:23:52 +0200 Michal Hocko wrote: > > > > > > > On Wed 24-10-18 15:19:50, Andrew Morton wrote: > > > > > On Tue, 23 Oct 2018 16:43:29 +0000 Roman Gushchin wrote: > > > > > > > > > > > Spock reported that the commit 172b06c32b94 ("mm: slowly shrink slabs > > > > > > with a relatively small number of objects") leads to a regression on > > > > > > his setup: periodically the majority of the pagecache is evicted > > > > > > without an obvious reason, while before the change the amount of free > > > > > > memory was balancing around the watermark. > > > > > > > > > > > > The reason behind is that the mentioned above change created some > > > > > > minimal background pressure on the inode cache. The problem is that > > > > > > if an inode is considered to be reclaimed, all belonging pagecache > > > > > > page are stripped, no matter how many of them are there. So, if a huge > > > > > > multi-gigabyte file is cached in the memory, and the goal is to > > > > > > reclaim only few slab objects (unused inodes), we still can eventually > > > > > > evict all gigabytes of the pagecache at once. > > > > > > > > > > > > The workload described by Spock has few large non-mapped files in the > > > > > > pagecache, so it's especially noticeable. > > > > > > > > > > > > To solve the problem let's postpone the reclaim of inodes, which have > > > > > > more than 1 attached page. Let's wait until the pagecache pages will > > > > > > be evicted naturally by scanning the corresponding LRU lists, and only > > > > > > then reclaim the inode structure. > > > > > > > > > > Is this regression serious enough to warrant fixing 4.19.1? > > > > > > > > Let's not forget about stable tree(s) which backported 172b06c32b94. I > > > > would suggest reverting there. > > > > > > Yup. Sasha, can you please take care of this? > > > > Sure, I'll revert it from current stable trees. > > > > Should 172b06c32b94 and this commit be backported once Roman confirms > > the issue is fixed? As far as I understand 172b06c32b94 addressed an > > issue FB were seeing in their fleet and needed to be fixed. > > The memcg leak was also independently reported by several companies, > so it's not only about our fleet. By memcg leak you mean a lot of dead memcgs with small amount of memory which are staying behind and the global memory pressure removes them only very slowly or almost not at all, right? I have avague recollection that systemd can trigger a pattern which makes this "leak" noticeable. Is that right? If yes what would be a minimal and safe fix for the stable tree? "mm: don't miss the last page because of round-off error" would sound like the candidate but I never got around to review it properly. > The memcg css leak is fixed by a series of commits (as in the mm tree): > 37e521912118 math64: prevent double calculation of DIV64_U64_ROUND_UP() arguments > c6be4e82b1b3 mm: don't miss the last page because of round-off error > f2e821fc8c63 mm: drain memcg stocks on css offlining > 03a971b56f18 mm: rework memcg kernel stack accounting btw. none of these sha are refering to anything in my git tree. They all seem to be in the next tree though. > 172b06c32b94 mm: slowly shrink slabs with a relatively small number of objects > > The last one by itself isn't enough, and it makes no sense to backport it > without all other patches. So, I'd either backport them all (including > 47036ad4032e ("mm: don't reclaim inodes with many attached pages"), > either just revert 172b06c32b94. > > Also 172b06c32b94 ("mm: slowly shrink slabs with a relatively small number of objects") > by itself is fine, but it reveals an independent issue in inode reclaim code, > which 47036ad4032e ("mm: don't reclaim inodes with many attached pages") aims to fix. To me it sounds it needs much more time to settle before it can be considered safe for the stable tree. Even if the patch itself is correct it seems too subtle and reveal a behavior which was not anticipated and that just proves it is far from straightforward. -- Michal Hocko SUSE Labs