Received: by 2002:a25:ab43:0:0:0:0:0 with SMTP id u61csp252283ybi; Wed, 29 May 2019 20:41:34 -0700 (PDT) X-Google-Smtp-Source: APXvYqxZgOFsn0Y5purgaqTH/N1eZGI9VLD6nkXM0ArcxYW53FzCLrRSTnFkQKBBpXlU4FgGHZaT X-Received: by 2002:a65:5304:: with SMTP id m4mr1748535pgq.126.1559187694495; Wed, 29 May 2019 20:41:34 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1559187694; cv=none; d=google.com; s=arc-20160816; b=J2U0Q5PlTCF4UB+jrXgqK85wdVNcQq3D5+y+paJLBwN9yvMyGosHIBGjGvXd40yffV bHIA3zKaPRYcFbTKjK22C9vN36+5eHFCIE9q7r3YXbAGTNRiZEvpe8P1zw/8/6CbEoEl UnZ73r4oN2GhH9tKTjP+Q3ME5v02a7+rICeZJO6ytnVXLyY460WYb0uAZQBP65OFqw6S V7iC9Vs8dzu+rq4MbYGA6JL13VRz5aWjD6qvhGJ+g6xQP1GKlB6zb01zEzSuDPo4xlgW W6itbp1dhlGG9kZC8uiQvhveULShi+DwOL+YD91YJvTJXdPUe/gOQFf4u/Uy59IHSX/H exCQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-language :content-transfer-encoding:in-reply-to:mime-version:user-agent:date :message-id:from:references:cc:to:subject; bh=xwtOiLuZd58QK4SB59arQBVji0E7OX67Ljil442vX9s=; b=w9j+6X4FpiC3UqTIbBwpGaFCHHjfgUa1dHlRwL0CKFMOe+/0P47qWDl4dPmfKmc6tx q5TFiFLpBBl2A0PSgFoRdq7sULjnAusiVwFNiK0XYuCOCl0IMoT8xKXgrnYG4s4GPwvu Gc0ZEN5dYYqznfau/+NkaPAXr0Hmrz7/jnqgBGKfz3kKjFKSK2Dgc+K49U22buCTGz6J guJohw4zkDo/T9o+eIjoUon5vSzDe7bsurpPZAL4Aaz8kreD9YdAj7XUFJgQ0zbPiUNE tQynvfu2S+0h6SijNS8U/jZmaUXZb8ZyON2V7WRCUbfBgj/ZR1k1BRleal10xQeaEWZa wiuQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id 68si1761050ple.89.2019.05.29.20.41.18; Wed, 29 May 2019 20:41:34 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727934AbfE3DjB (ORCPT + 99 others); Wed, 29 May 2019 23:39:01 -0400 Received: from out30-133.freemail.mail.aliyun.com ([115.124.30.133]:49725 "EHLO out30-133.freemail.mail.aliyun.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1732844AbfE3DWb (ORCPT ); Wed, 29 May 2019 23:22:31 -0400 X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R451e4;CH=green;DM=||false|;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e07487;MF=yang.shi@linux.alibaba.com;NM=1;PH=DS;RN=10;SR=0;TI=SMTPD_---0TT-SMAB_1559186545; Received: from US-143344MP.local(mailfrom:yang.shi@linux.alibaba.com fp:SMTPD_---0TT-SMAB_1559186545) by smtp.aliyun-inc.com(127.0.0.1); Thu, 30 May 2019 11:22:26 +0800 Subject: Re: [RFC PATCH 0/3] Make deferred split shrinker memcg aware To: David Rientjes Cc: ktkhai@virtuozzo.com, hannes@cmpxchg.org, mhocko@suse.com, kirill.shutemov@linux.intel.com, hughd@google.com, shakeelb@google.com, Andrew Morton , linux-mm@kvack.org, linux-kernel@vger.kernel.org References: <1559047464-59838-1-git-send-email-yang.shi@linux.alibaba.com> <2e23bd8c-6120-5a86-9e9e-ab43b02ce150@linux.alibaba.com> From: Yang Shi Message-ID: <9af25d50-576a-3cc3-20a3-c0c61cf3e494@linux.alibaba.com> Date: Thu, 30 May 2019 11:22:21 +0800 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:52.0) Gecko/20100101 Thunderbird/52.7.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Content-Language: en-US Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 5/30/19 5:07 AM, David Rientjes wrote: > On Wed, 29 May 2019, Yang Shi wrote: > >>> Right, we've also encountered this. I talked to Kirill about it a week or >>> so ago where the suggestion was to split all compound pages on the >>> deferred split queues under the presence of even memory pressure. >>> >>> That breaks cgroup isolation and perhaps unfairly penalizes workloads that >>> are running attached to other memcg hierarchies that are not under >>> pressure because their compound pages are now split as a side effect. >>> There is a benefit to keeping these compound pages around while not under >>> memory pressure if all pages are subsequently mapped again. >> Yes, I do agree. I tried other approaches too, it sounds making deferred split >> queue per memcg is the optimal one. >> > The approach we went with were to track the actual counts of compound > pages on the deferred split queue for each pgdat for each memcg and then > invoke the shrinker for memcg reclaim and iterate those not charged to the > hierarchy under reclaim. That's suboptimal and was a stop gap measure > under time pressure: it's refreshing to see the optimal method being > pursued, thanks! We did the exactly same thing for a temporary hotfix. > >>> I'm curious if your internal applications team is also asking for >>> statistics on how much memory can be freed if the deferred split queues >>> can be shrunk? We have applications that monitor their own memory usage >> No, but this reminds me. The THPs on deferred split queue should be accounted >> into available memory too. >> > Right, and we have also seen this for users of MADV_FREE that have both an > increased rss and memcg usage that don't realize that the memory is freed > under pressure. I'm thinking that we need some kind of MemAvailable for > memcg hierarchies to be the authoritative source of what can be reclaimed > under pressure. It sounds useful. We also need know the available memory in memcg scope in our containers. > >>> through memcg stats or usage and proactively try to reduce that usage when >>> it is growing too large. The deferred split queues have significantly >>> increased both memcg usage and rss when they've upgraded kernels. >>> >>> How are your applications monitoring how much memory from deferred split >>> queues can be freed on memory pressure? Any thoughts on providing it as a >>> memcg stat? >> I don't think they have such monitor. I saw rss_huge is abormal in memcg stat >> even after the application is killed by oom, so I realized the deferred split >> queue may play a role here. >> > Exactly the same in my case :) We were likely looking at the exact same > issue at the same time. Yes, it seems so. :-) >> The memcg stat doesn't have counters for available memory as global vmstat. It >> may be better to have such statistics, or extending reclaimable "slab" to >> shrinkable/reclaimable "memory". >> > Have you considered following how NR_ANON_MAPPED is tracked for each pgdat > and using that as an indicator of when the modify a memcg stat to track > the amount of memory on a compound page? I think this would be necessary > for userspace to know what their true memory usage is. No, I haven't. Do you mean minus MADV_FREE and deferred split THP from NR_ANON_MAPPED? It looks they have been decreased from NR_ANON_MAPPED when removing rmap.