Received: by 2002:a25:ab43:0:0:0:0:0 with SMTP id u61csp5761025ybi; Tue, 28 May 2019 19:35:50 -0700 (PDT) X-Google-Smtp-Source: APXvYqw07ehne3IofBm/nuvPXo0syU4p29xKdvnZrGIXiyfALzxqpr01QOia5vKQP1PjBssqpd9C X-Received: by 2002:a17:902:ba85:: with SMTP id k5mr19278972pls.76.1559097350336; Tue, 28 May 2019 19:35:50 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1559097350; cv=none; d=google.com; s=arc-20160816; b=wKjvM90EER2QYoNLwECw9yWqgDUDF0VmG4MqUJOlxhVx5Hm5VP+46uFrdO0KCj7Gi5 rzXYLXEy685c+xkMfya5dd6TZmQCm9t1AIFvtVQKcmdwHJ7zp+KqXXwKgZOoaB+9OnKP XNPHY/Bex4rIRxYFfQgmU4kdLWzcTKiD5X9nys3bfpXma1+sbwHcr/Dx6to1bbVhsrmx EgpDZnD0PCCuWMm4jrQVRcqDyjUUHYESBaxSfPTxgKAAf5HZaFiqYPPPupB7A1BLyq3H M8IEqY1+O23TC2Nrw5XEH4v/f7FQcC0gBc6eoMttcDuu9WqUz+DqGYtkx46+ET3UQvzn uvYQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-language :content-transfer-encoding:in-reply-to:mime-version:user-agent:date :message-id:from:references:cc:to:subject; bh=V32LT4utKo0TDLbAK6fqAj6cLnWicD8aRnm9gNomHVM=; b=dRhpC/9YbjVWvlNVCU4Ta8Mws5XbeP7m+wvgwCkTQ7PXQDv+2Bn9VDqa0IBmuB/s17 zwCypBoBfPg7hjSHOj1VpcYG7RXMX+P97UgJPMH7d0y+Ae3caxdIIPbEQAC1YOVon6rl nMJCrCYcyNcl80SA04olaMVL8lY+/RX1cgsmQf32IFNxk/gFGfUU9kBFcwur2MuPUhFR uJjuoGb3x5hJWetQ2H0eeu68ULxkTye15pd4Os7oDhVJhbvJ51A55WMMPftInhyrgrg7 6tegZaz04BcChh6vt7icz9bgFQKtekXOnK5qOyBdWN04gl3qygLjfBjqS9qqOTWnjuUm gcxw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id h16si223021pfo.37.2019.05.28.19.35.34; Tue, 28 May 2019 19:35:50 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726062AbfE2Ce3 (ORCPT + 99 others); Tue, 28 May 2019 22:34:29 -0400 Received: from out30-56.freemail.mail.aliyun.com ([115.124.30.56]:37206 "EHLO out30-56.freemail.mail.aliyun.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725816AbfE2Ce3 (ORCPT ); Tue, 28 May 2019 22:34:29 -0400 X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R381e4;CH=green;DM=||false|;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e04420;MF=yang.shi@linux.alibaba.com;NM=1;PH=DS;RN=10;SR=0;TI=SMTPD_---0TSvOYQ4_1559097264; Received: from US-143344MP.local(mailfrom:yang.shi@linux.alibaba.com fp:SMTPD_---0TSvOYQ4_1559097264) by smtp.aliyun-inc.com(127.0.0.1); Wed, 29 May 2019 10:34:25 +0800 Subject: Re: [RFC PATCH 0/3] Make deferred split shrinker memcg aware To: David Rientjes Cc: ktkhai@virtuozzo.com, hannes@cmpxchg.org, mhocko@suse.com, kirill.shutemov@linux.intel.com, hughd@google.com, shakeelb@google.com, akpm@linux-foundation.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org References: <1559047464-59838-1-git-send-email-yang.shi@linux.alibaba.com> From: Yang Shi Message-ID: <2e23bd8c-6120-5a86-9e9e-ab43b02ce150@linux.alibaba.com> Date: Wed, 29 May 2019 10:34:24 +0800 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:52.0) Gecko/20100101 Thunderbird/52.7.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Content-Language: en-US Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 5/29/19 9:22 AM, David Rientjes wrote: > On Tue, 28 May 2019, Yang Shi wrote: > >> I got some reports from our internal application team about memcg OOM. >> Even though the application has been killed by oom killer, there are >> still a lot THPs reside, page reclaim doesn't reclaim them at all. >> >> Some investigation shows they are on deferred split queue, memcg direct >> reclaim can't shrink them since THP deferred split shrinker is not memcg >> aware, this may cause premature OOM in memcg. The issue can be >> reproduced easily by the below test: >> > Right, we've also encountered this. I talked to Kirill about it a week or > so ago where the suggestion was to split all compound pages on the > deferred split queues under the presence of even memory pressure. > > That breaks cgroup isolation and perhaps unfairly penalizes workloads that > are running attached to other memcg hierarchies that are not under > pressure because their compound pages are now split as a side effect. > There is a benefit to keeping these compound pages around while not under > memory pressure if all pages are subsequently mapped again. Yes, I do agree. I tried other approaches too, it sounds making deferred split queue per memcg is the optimal one. > >> $ cgcreate -g memory:thp >> $ echo 4G > /sys/fs/cgroup/memory/thp/memory/limit_in_bytes >> $ cgexec -g memory:thp ./transhuge-stress 4000 >> >> transhuge-stress comes from kernel selftest. >> >> It is easy to hit OOM, but there are still a lot THP on the deferred split >> queue, memcg direct reclaim can't touch them since the deferred split >> shrinker is not memcg aware. >> > Yes, we have seen this on at least 4.15 as well. > >> Convert deferred split shrinker memcg aware by introducing per memcg deferred >> split queue. The THP should be on either per node or per memcg deferred >> split queue if it belongs to a memcg. When the page is immigrated to the >> other memcg, it will be immigrated to the target memcg's deferred split queue >> too. >> >> And, move deleting THP from deferred split queue in page free before memcg >> uncharge so that the page's memcg information is available. >> >> Reuse the second tail page's deferred_list for per memcg list since the same >> THP can't be on multiple deferred split queues at the same time. >> >> Remove THP specific destructor since it is not used anymore with memcg aware >> THP shrinker (Please see the commit log of patch 2/3 for the details). >> >> Make deferred split shrinker not depend on memcg kmem since it is not slab. >> It doesn't make sense to not shrink THP even though memcg kmem is disabled. >> >> With the above change the test demonstrated above doesn't trigger OOM anymore >> even though with cgroup.memory=nokmem. >> > I'm curious if your internal applications team is also asking for > statistics on how much memory can be freed if the deferred split queues > can be shrunk? We have applications that monitor their own memory usage No, but this reminds me. The THPs on deferred split queue should be accounted into available memory too. > through memcg stats or usage and proactively try to reduce that usage when > it is growing too large. The deferred split queues have significantly > increased both memcg usage and rss when they've upgraded kernels. > > How are your applications monitoring how much memory from deferred split > queues can be freed on memory pressure? Any thoughts on providing it as a > memcg stat? I don't think they have such monitor. I saw rss_huge is abormal in memcg stat even after the application is killed by oom, so I realized the deferred split queue may play a role here. The memcg stat doesn't have counters for available memory as global vmstat. It may be better to have such statistics, or extending reclaimable "slab" to shrinkable/reclaimable "memory". > > Thanks!