Received: by 2002:a25:8b12:0:0:0:0:0 with SMTP id i18csp5437378ybl; Tue, 27 Aug 2019 04:49:46 -0700 (PDT) X-Google-Smtp-Source: APXvYqzZXTsP3Hm1ePTjZCHMQmEtv259qwm313RBuEmpZtbov1g8wjZfeb9cipt2iJPxmGb3IUSt X-Received: by 2002:a63:6206:: with SMTP id w6mr20744912pgb.428.1566906586518; Tue, 27 Aug 2019 04:49:46 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1566906586; cv=none; d=google.com; s=arc-20160816; b=OdEUfKyfUsz5njqoe5WD4+OVCbptCgDfw3ib3f1dhDdJ6KS1TCYVNZDTQBiKaL2tq7 afqm7g6daqN6DBaW6BXfCABE4H/fFUc9cPjsq9DVIpjcXFQSaYgsnU/YfwzCZjTRZp6R Mtol7Zb3COzdrgtwVVAfIgq27tKyW9yFv27hF7PHbJ5Z7GveN0KZ9O3sfIMYf4UWkeZu sUTNBhO6U0V8ghl19bng6QUv4TiUEzOg9rTy3H48r0XAozbG6xVNbx1RFEvuMAv4ZIeo XioVMukf/yJ3MjV080ZzSAaLIWTKUZoc2Hm4/bgG8NZj6nDMUKvWCpY4qnzggcsfdfRS IWRQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date; bh=+nt6hzfkXrvDA3AyzrSiC0FbynJYA1RC9ht2wlx5KPg=; b=x0EaKIQY1j5gPXqWZaS1PEZEZp+dScsE/R8YdNcdX/r0Dz70cpdC/hd8ZqsQFP2OQz 5GYvF6o9OUAECQy3ntnWeyB6MDodzgyAZyAO02nQRDp8PeRUja/rcLoHEfD9mKBZ3Ib4 7Mw23F8XxxZWK2m7lmtmM31TPSKjYHXmHxU3lKwiJAaY+wrDYCXahs0WF6DXxl8gaKbc DhyoWyMO1HOTJZE2oMQ9wa9zvXRPp6ndMN756RFidQgFeSRZP41RWH3reBqrR1dyFTUm oJSpgEcL3y2tL1Iz4neMrYsjQA/bEED7WTZmin+/Gzf/hRtOq6jC5WQEzTx1UDw+5uO3 cUZg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id 124si11782734pgb.11.2019.08.27.04.49.30; Tue, 27 Aug 2019 04:49:46 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728763AbfH0LsL (ORCPT + 99 others); Tue, 27 Aug 2019 07:48:11 -0400 Received: from mx2.suse.de ([195.135.220.15]:42684 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1726522AbfH0LsL (ORCPT ); Tue, 27 Aug 2019 07:48:11 -0400 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id 6E947AF19; Tue, 27 Aug 2019 11:48:09 +0000 (UTC) Date: Tue, 27 Aug 2019 13:48:08 +0200 From: Michal Hocko To: "Kirill A. Shutemov" Cc: Vlastimil Babka , kirill.shutemov@linux.intel.com, Yang Shi , hannes@cmpxchg.org, rientjes@google.com, akpm@linux-foundation.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [v2 PATCH -mm] mm: account deferred split THPs into MemAvailable Message-ID: <20190827114808.GY7538@dhcp22.suse.cz> References: <1566410125-66011-1-git-send-email-yang.shi@linux.alibaba.com> <20190822080434.GF12785@dhcp22.suse.cz> <20190822152934.w6ztolutdix6kbvc@box> <20190826074035.GD7538@dhcp22.suse.cz> <20190826131538.64twqx3yexmhp6nf@box> <20190827060139.GM7538@dhcp22.suse.cz> <20190827110210.lpe36umisqvvesoa@box> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20190827110210.lpe36umisqvvesoa@box> User-Agent: Mutt/1.10.1 (2018-07-13) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue 27-08-19 14:02:10, Kirill A. Shutemov wrote: > On Tue, Aug 27, 2019 at 08:01:39AM +0200, Michal Hocko wrote: > > On Mon 26-08-19 16:15:38, Kirill A. Shutemov wrote: > > > On Mon, Aug 26, 2019 at 09:40:35AM +0200, Michal Hocko wrote: > > > > On Thu 22-08-19 18:29:34, Kirill A. Shutemov wrote: > > > > > On Thu, Aug 22, 2019 at 02:56:56PM +0200, Vlastimil Babka wrote: > > > > > > On 8/22/19 10:04 AM, Michal Hocko wrote: > > > > > > > On Thu 22-08-19 01:55:25, Yang Shi wrote: > > > > > > >> Available memory is one of the most important metrics for memory > > > > > > >> pressure. > > > > > > > > > > > > > > I would disagree with this statement. It is a rough estimate that tells > > > > > > > how much memory you can allocate before going into a more expensive > > > > > > > reclaim (mostly swapping). Allocating that amount still might result in > > > > > > > direct reclaim induced stalls. I do realize that this is simple metric > > > > > > > that is attractive to use and works in many cases though. > > > > > > > > > > > > > >> Currently, the deferred split THPs are not accounted into > > > > > > >> available memory, but they are reclaimable actually, like reclaimable > > > > > > >> slabs. > > > > > > >> > > > > > > >> And, they seems very common with the common workloads when THP is > > > > > > >> enabled. A simple run with MariaDB test of mmtest with THP enabled as > > > > > > >> always shows it could generate over fifteen thousand deferred split THPs > > > > > > >> (accumulated around 30G in one hour run, 75% of 40G memory for my VM). > > > > > > >> It looks worth accounting in MemAvailable. > > > > > > > > > > > > > > OK, this makes sense. But your above numbers are really worrying. > > > > > > > Accumulating such a large amount of pages that are likely not going to > > > > > > > be used is really bad. They are essentially blocking any higher order > > > > > > > allocations and also push the system towards more memory pressure. > > > > > > > > > > > > > > IIUC deferred splitting is mostly a workaround for nasty locking issues > > > > > > > during splitting, right? This is not really an optimization to cache > > > > > > > THPs for reuse or something like that. What is the reason this is not > > > > > > > done from a worker context? At least THPs which would be freed > > > > > > > completely sound like a good candidate for kworker tear down, no? > > > > > > > > > > > > Agreed that it's a good question. For Kirill :) Maybe with kworker approach we > > > > > > also wouldn't need the cgroup awareness? > > > > > > > > > > I don't remember a particular locking issue, but I cannot say there's > > > > > none :P > > > > > > > > > > It's artifact from decoupling PMD split from compound page split: the same > > > > > page can be mapped multiple times with combination of PMDs and PTEs. Split > > > > > of one PMD doesn't need to trigger split of all PMDs and underlying > > > > > compound page. > > > > > > > > > > Other consideration is the fact that page split can fail and we need to > > > > > have fallback for this case. > > > > > > > > > > Also in most cases THP split would be just waste of time if we would do > > > > > them at the spot. If you don't have memory pressure it's better to wait > > > > > until process termination: less pages on LRU is still beneficial. > > > > > > > > This might be true but the reality shows that a lot of THPs might be > > > > waiting for the memory pressure that is essentially freeable on the > > > > spot. So I am not really convinced that "less pages on LRUs" is really a > > > > plausible justification. Can we free at least those THPs which are > > > > unmapped completely without any pte mappings? > > > > > > Unmapped completely pages will be freed with current code. Deferred split > > > only applies to partly mapped THPs: at least on 4k of the THP is still > > > mapped somewhere. > > > > Hmm, I am probably misreading the code but at least current Linus' tree > > reads page_remove_rmap -> [page_remove_anon_compound_rmap ->\ deferred_split_huge_page even > > for fully mapped THP. > > Well, you read correctly, but it was not intended. I screwed it up at some > point. > > See the patch below. It should make it work as intened. OK, this would be indeed much better indeed. I was really under impression that the deferred splitting is required due to locking. Anyway this should take care of the most common usecase. If we can make the odd cases of partially mapped THPs be handled deferred&earlier than maybe do not really need the whole memcg deferred shrinkers and other complications. So let's see. -- Michal Hocko SUSE Labs