Received: by 2002:a25:ef43:0:0:0:0:0 with SMTP id w3csp124310ybm; Thu, 28 May 2020 17:59:15 -0700 (PDT) X-Google-Smtp-Source: ABdhPJx4oEd/HVlVwSGQ4DkebN4Od/4tQA6BuWYYh31Kh2y9aOKb95Gntr7mrHakx/SkfOvobrK5 X-Received: by 2002:aa7:d6d1:: with SMTP id x17mr5988936edr.257.1590713955044; Thu, 28 May 2020 17:59:15 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1590713955; cv=none; d=google.com; s=arc-20160816; b=LIYk6WcpR7nBu13CtgpIs4PWPYzydgmQMnIee7SR6BLJsdj/KDNp6/ScrCuMZXnGKL +IwvNvZYCzdnp9RvB9B/y4MKXMdbjkV22jPrw1eQ2Qip2Td5+59u9aqwftQLcgMzLmT1 ueoNKf3Ild+j9wbw8gE6rdfZ08qy13HzeNAmd/bHLqEMEkS2KqkDm4fHeBMlfh0oo5RF KY4sAk8+V5iEFF5voVz+YCFm+RPo285uSh11EMZqC44OFByeh/ilNDNDjqcWwfzgxYwb Rbv3gXnO7Wt8EulS3bix8UhAHShaLCPBJKo/Ab0mcvzxKIaiqPZrUUpFrCLdKMfpo3V8 C43A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=RUE+Kgr61UsTE5bnt2/uDqF89IqriBTI/I4RUKTjWYY=; b=wGoy1cpVKoKK6pvufcv76z25NxjHFAwbYuTR7GGwamR25Ji1igTJjK1ttRyTve2jyD JmL16Dj8cLpxNiUIWSWFysb6LjswP3qMrpl1yZ+Mn0891jFWB+MS8FwtpeSbTtS0cYc0 sji0m6n6ifW/q5cQWwNPQX2Kc6S2pVzUGkJ834s8qRsWmxKXlSkFrfKs/3OaDRrrpm/a KIXeQq/HfeljDsQVvzyMqkInIchzyQZhw3Xi3FjQ2Ta+NTBVgjpL37GVW0patNaU8EUp 9t0w3D6Beu8q11E3cXv6Q63XnKpaP7Jtdc/m5+ISxV16ZQGkN3TB0bASXLVXvx5HTsIt c41g== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@nitingupta.dev header.s=google header.b=dPkfSwSD; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id n8si4869721ejh.210.2020.05.28.17.58.40; Thu, 28 May 2020 17:59:15 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@nitingupta.dev header.s=google header.b=dPkfSwSD; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2438206AbgE2Azh (ORCPT + 99 others); Thu, 28 May 2020 20:55:37 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:55174 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2438167AbgE2Azg (ORCPT ); Thu, 28 May 2020 20:55:36 -0400 Received: from mail-lj1-x241.google.com (mail-lj1-x241.google.com [IPv6:2a00:1450:4864:20::241]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id B6DD8C08C5C6 for ; Thu, 28 May 2020 17:55:35 -0700 (PDT) Received: by mail-lj1-x241.google.com with SMTP id s1so517341ljo.0 for ; Thu, 28 May 2020 17:55:35 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=nitingupta.dev; s=google; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=RUE+Kgr61UsTE5bnt2/uDqF89IqriBTI/I4RUKTjWYY=; b=dPkfSwSDZ3gLUdiYx2RfPaPXx3H92unDba68v4CdK8Au2YwLggppxIiJ44Uff7amdP D7Jm95g3tohD1kWlFJU8ZRKeHr8J2qf5CMv6osN8MyZVBWh9vx5Y9BLyl/L811sRPqa1 CtcEvczXWzcoE2GO7JxhqtLkkKk1mA0b92UllGcNzvfqtqVbGW1IaCLfbvaorN77x4WI YpZ89etJZeJMJLRBbPwgkb7vl9xzPDvby1lMj0L013EgoSRCEJJOzBaVTlRpUDrieoQ9 RvkGbxg81lNNbGXXwvGZBsy8vA1T4FmVGPyqE5dONSDYmkIquzpcawvMOKix8Mx1sTaY owTg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=RUE+Kgr61UsTE5bnt2/uDqF89IqriBTI/I4RUKTjWYY=; b=fWLh5CMU1EYLecFUAmKt9OUx9v4PeqBUWBT3sR/qsIyKsSF84kxq1TMPfaCvpf7/if naTA/OCDHQm3bi9g/Qx478AeoHWLIlZaNqriTL8UCkhnA3Ys6D35/iyYIHE5r7AqAn5M cWeKLRzrFC8j/+MuJWNm1Q0YWYq1xk1JUPhKmlqkyCG8Ubm5RZiSa6ABIbufazI8gU6k uMyouWcciWa4NPEs/KDvlo9DEDvf6zP5VlBt12ye5kjo5/qPeBjur9lYERMImwv/HvvI Jig6JlDDyuyzRkmhlATgDbyCqnQJVmL8dvp3tHSv4kmX2eNArQD8Lu2v/7CHuTSucCIZ cxog== X-Gm-Message-State: AOAM533UdtH2/SHr2C1A+VR6buCKVUhk/GytgjSKhOupUnk5f0FDfUVu MoqzD3hAS4AFlzOQaBBSMBb/D+xESWAGZtElupey4VkWdyk= X-Received: by 2002:a2e:9746:: with SMTP id f6mr2714185ljj.189.1590713733999; Thu, 28 May 2020 17:55:33 -0700 (PDT) MIME-Version: 1.0 References: <20200518181446.25759-1-nigupta@nvidia.com> In-Reply-To: From: Nitin Gupta Date: Thu, 28 May 2020 17:55:22 -0700 Message-ID: Subject: Re: [PATCH v5] mm: Proactive compaction To: Khalid Aziz Cc: Nitin Gupta , Mel Gorman , Michal Hocko , Vlastimil Babka , Matthew Wilcox , Andrew Morton , Mike Kravetz , Joonsoo Kim , David Rientjes , linux-kernel , linux-mm , Linux API Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, May 28, 2020 at 4:32 PM Khalid Aziz wrote: > > This looks good to me. I like the idea overall of controlling > aggressiveness of compaction with a single tunable for the whole > system. I wonder how an end user could arrive at what a reasonable > value would be for this based upon their workload. More comments below. > Tunables like the one this patch introduces, and similar ones like 'swappiness' will always require some experimentations from the user. > On Mon, 2020-05-18 at 11:14 -0700, Nitin Gupta wrote: > > For some applications, we need to allocate almost all memory as > > hugepages. However, on a running system, higher-order allocations can > > fail if the memory is fragmented. Linux kernel currently does on- > > demand > > compaction as we request more hugepages, but this style of compaction > > incurs very high latency. Experiments with one-time full memory > > compaction (followed by hugepage allocations) show that kernel is > > able > > to restore a highly fragmented memory state to a fairly compacted > > memory > > state within <1 sec for a 32G system. Such data suggests that a more > > proactive compaction can help us allocate a large fraction of memory > > as > > hugepages keeping allocation latencies low. > > > > For a more proactive compaction, the approach taken here is to define > > a new tunable called 'proactiveness' which dictates bounds for > > external > > fragmentation wrt HUGETLB_PAGE_ORDER order which kcompactd tries to > > maintain. > > > > The tunable is exposed through sysctl: > > /proc/sys/vm/compaction_proactiveness > > > > It takes value in range [0, 100], with a default of 20. > > Looking at the code, setting this to 100 would mean system would > continuously strive to drive level of fragmentation down to 0 which can > not be reasonable and would bog the system down. A cap lower than 100 > might be a good idea to keep kcompactd from dragging system down. > Yes, I understand that a value of 100 would be a continuous compaction storm but I still don't want to artificially cap the tunable. The interpretation of this tunable can change in future, and a range of [0, 100] seems more intuitive than, say [0, 90]. Still, I think a word of caution should be added to its documentation (admin-guide/sysctl/vm.rst). > > > > Total 2M hugepages allocated = 383859 (749G worth of hugepages out of > > 762G total free => 98% of free memory could be allocated as > > hugepages) > > > > - With 5.6.0-rc3 + this patch, with proactiveness=20 > > > > echo 20 | sudo tee /sys/kernel/mm/compaction/node-*/proactiveness > > Should be "echo 20 | sudo tee /proc/sys/vm/compaction_proactiveness" > oops... I forgot to update the patch description. This is from the v4 patch which used sysfs but v5 switched to using sysctl. > > > > diff --git a/Documentation/admin-guide/sysctl/vm.rst > > b/Documentation/admin-guide/sysctl/vm.rst > > index 0329a4d3fa9e..e5d88cabe980 100644 > > --- a/Documentation/admin-guide/sysctl/vm.rst > > +++ b/Documentation/admin-guide/sysctl/vm.rst > > @@ -119,6 +119,19 @@ all zones are compacted such that free memory is > > available in contiguous > > blocks where possible. This can be important for example in the > > allocation of > > huge pages although processes will also directly compact memory as > > required. > > > > +compaction_proactiveness > > +======================== > > + > > +This tunable takes a value in the range [0, 100] with a default > > value of > > +20. This tunable determines how aggressively compaction is done in > > the > > +background. Setting it to 0 disables proactive compaction. > > + > > +Note that compaction has a non-trivial system-wide impact as pages > > +belonging to different processes are moved around, which could also > > lead > > +to latency spikes in unsuspecting applications. The kernel employs > > +various heuristics to avoid wasting CPU cycles if it detects that > > +proactive compaction is not being effective. > > + > > Value of 100 would cause kcompactd to try to bring fragmentation down > to 0. If hugepages are being consumed and released continuously by the > workload, it is possible that kcompactd keeps making progress (and > hence passes the test "proactive_defer = score < prev_score ?") > continuously but can not reach a fragmentation score of 0 and hence > gets stuck in compact_zone() for a long time. Page migration for > compaction is not inexpensive. Maybe either cap the value to something > less than 100 or set a floor for wmark_low above 0. > > Some more guidance regarding the value for this tunable might be > helpful here, something along the lines of what does a value of 100 > mean in terms of how kcompactd will behave. It can then give end user a > better idea of what they are getting at what cost. You touch upon the > cost above. Just add some more details so an end user can get a better > idea of size of the cost for higher values of this tunable. > I like the idea of capping wmark_low to say, 5 to prevent admins from overloading the system. Similarly, wmark_high should be capped at say, 95 to allow tunable values below 10 to have any effect: currently such low tunable values would give wmark_high=100 which would cause proactive compaction to never get triggered. Finally, I see your concern about lack of guidance on extreme values of the tunable. I will address this in the next (v6) iteration. Thanks, Nitin