Received: by 2002:a25:c593:0:0:0:0:0 with SMTP id v141csp4584170ybe; Mon, 16 Sep 2019 14:57:23 -0700 (PDT) X-Google-Smtp-Source: APXvYqwds/eRNhGGbOoSruO0LEkimejw6CakcWUZNc353nLZR9JKfwpcmAT7Tar4KQXGtT3bvsvd X-Received: by 2002:a50:e611:: with SMTP id y17mr1548481edm.66.1568671043046; Mon, 16 Sep 2019 14:57:23 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1568671043; cv=none; d=google.com; s=arc-20160816; b=DJaxP7aDMdnCwb/3hmFzEmghnD7vdBRGmr0Y4IhuwgYsqYQOxe3EBdpNchzQ7vgLMo 56h6yswrP0k/oZo2OXxz22nZ4qhhHoyJQ37oiyQI5jCBhYsz2XeZLzFz4M8F473YgBf3 ++HhRANeCI0EnCF8N0cfrQQ9WZrBUd1DzHQBttEqoMNYFHHwIJIzPD38BvDcv+o6bYON 6CuvRbXlp6tP8WhGMXtZ+IAnbwQqpfmvl3Q2q8l1LWQDhQp9tw8zXuN37KN+2AMxxLUf WpceSsg3vItz7gs8jus+ua4oFfH8i2WuLojXwHqWbfhzWGnS+Gi9WamK/S3R7ZO8CUSy nXtg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:mime-version:user-agent:references :message-id:in-reply-to:subject:cc:to:from:date:dkim-signature; bh=l7XNvVLoq2DkM/BJzCh4tZgRffWcTPePk08fCu1QvA4=; b=Cakbyw3eBEcjNcWpZ6bXQB3q2ugHyqfQtxrBvUcMfhxu6wLNFYQNwA0iVlKpQ++eC/ LVD2gdg4v2rs+4m6Rk0BCcR3CR3vMXQAzH5/7mOZ6vR8TGqTTKjaptxYadIX+uDC8CqX KSNCUUytHUYmbeS6S8y+90LllZW4utadiCipd14gcw74WwxZmAbeJyStDmn1wT1g2OwS AG/PEt/oFCo7eRtf2QfU5E5AZLGH1bFPWTEHBVvkf8c4IMmHvmyLqrYO5Y+IqnWXvkv5 fdupfnVlUg5t+vxUVM4gfqNS0xbP6X2hL1l3nXzhJM0SezmfB05wZSnm0LIocr+ccynW C7Uw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=ie+MVgQk; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id q18si84817ejf.153.2019.09.16.14.56.59; Mon, 16 Sep 2019 14:57:23 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=ie+MVgQk; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2391208AbfIPUQi (ORCPT + 99 others); Mon, 16 Sep 2019 16:16:38 -0400 Received: from mail-pl1-f195.google.com ([209.85.214.195]:46689 "EHLO mail-pl1-f195.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727770AbfIPUQi (ORCPT ); Mon, 16 Sep 2019 16:16:38 -0400 Received: by mail-pl1-f195.google.com with SMTP id q24so375151plr.13 for ; Mon, 16 Sep 2019 13:16:37 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=date:from:to:cc:subject:in-reply-to:message-id:references :user-agent:mime-version; bh=l7XNvVLoq2DkM/BJzCh4tZgRffWcTPePk08fCu1QvA4=; b=ie+MVgQkv+or+j8rP4f5XQk3UqeS/AWrMIsnBe6inekVRMwuNTTlDKwqToPjtv6V9q 0CyusXtPz1lH21Zxc7HfH1A7Dy8y5y9lg8DlZte+zLDTVs4b2KVTU2bE/oDr0Lkl/+66 pCZ7z1W7G7/Pz6Kc+SsfA+W5diGDbUbYLrsd6aiXgXx5syZBXxOaJPKj+kqcAnKSFUSr ms8XPC2Znhix8Yfhw6MX0Iva5i2umckTNgjTHZ122ws7+cF4/gjAWKtC7X3MefyHi2LT rgbqPj3L6f0tdi5b6jiJOiEJcn6F0ZhqkenvfvfvM6KlxO5UogNKVA63h0MVlfGEYCNO CRyw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:in-reply-to:message-id :references:user-agent:mime-version; bh=l7XNvVLoq2DkM/BJzCh4tZgRffWcTPePk08fCu1QvA4=; b=CQpJgOvbmb6p+svrzXjxdii+ISFaC++kCkiPbBMWsPo8Mbuu/r/+Wo8CJpvrC8ciY5 PM77oIqUVp69hCouNVtkUpWRkwohz1iMGzpdiVx8np0P1ZoCSXnuhCgsk8UOrA76pTo9 FRdvUmNM49cPFdb4+VQxf32HqZ+zF8Qgv8KJ/54NC9cUhaA+/Er7amAy7g1o4MkB6Yo8 2jg5+LnOlJP+gr6DhHspmjXiJgBo/WTYn5AnbmCVAx8sQQ+W4FpcBKPzbzMXH3V98Ent V9ZL0IFbUOX3RqadgufZSydujzwyS8t9doizqyOu/plY0IVCw+h0HYdIv3WbKK+RWWcg aLag== X-Gm-Message-State: APjAAAX9JMHBtc0i7hXz0wxk+v7A9ghtp3wd+bb28fwYgo0FVTvpJ+VQ Q853kHlLcNyTcKXCiT6gxUzitg== X-Received: by 2002:a17:902:426:: with SMTP id 35mr1740575ple.192.1568664997050; Mon, 16 Sep 2019 13:16:37 -0700 (PDT) Received: from [2620:15c:17:3:3a5:23a7:5e32:4598] ([2620:15c:17:3:3a5:23a7:5e32:4598]) by smtp.gmail.com with ESMTPSA id l62sm61892849pfl.167.2019.09.16.13.16.35 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 16 Sep 2019 13:16:36 -0700 (PDT) Date: Mon, 16 Sep 2019 13:16:35 -0700 (PDT) From: David Rientjes X-X-Sender: rientjes@chino.kir.corp.google.com To: Nitin Gupta cc: akpm@linux-foundation.org, vbabka@suse.cz, mgorman@techsingularity.net, mhocko@suse.com, dan.j.williams@intel.com, Yu Zhao , Matthew Wilcox , Qian Cai , Andrey Ryabinin , Roman Gushchin , Greg Kroah-Hartman , Kees Cook , Jann Horn , Johannes Weiner , Arun KS , Janne Huttunen , Konstantin Khlebnikov , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [RFC] mm: Proactive compaction In-Reply-To: <20190816214413.15006-1-nigupta@nvidia.com> Message-ID: References: <20190816214413.15006-1-nigupta@nvidia.com> User-Agent: Alpine 2.21 (DEB 202 2017-01-01) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, 16 Aug 2019, Nitin Gupta wrote: > For some applications we need to allocate almost all memory as > hugepages. However, on a running system, higher order allocations can > fail if the memory is fragmented. Linux kernel currently does > on-demand compaction as we request more hugepages but this style of > compaction incurs very high latency. Experiments with one-time full > memory compaction (followed by hugepage allocations) shows that kernel > is able to restore a highly fragmented memory state to a fairly > compacted memory state within <1 sec for a 32G system. Such data > suggests that a more proactive compaction can help us allocate a large > fraction of memory as hugepages keeping allocation latencies low. > > For a more proactive compaction, the approach taken here is to define > per page-order external fragmentation thresholds and let kcompactd > threads act on these thresholds. > > The low and high thresholds are defined per page-order and exposed > through sysfs: > > /sys/kernel/mm/compaction/order-[1..MAX_ORDER]/extfrag_{low,high} > > Per-node kcompactd thread is woken up every few seconds to check if > any zone on its node has extfrag above the extfrag_high threshold for > any order, in which case the thread starts compaction in the backgrond > till all zones are below extfrag_low level for all orders. By default > both these thresolds are set to 100 for all orders which essentially > disables kcompactd. > > To avoid wasting CPU cycles when compaction cannot help, such as when > memory is full, we check both, extfrag > extfrag_high and > compaction_suitable(zone). This allows kcomapctd thread to stays inactive > even if extfrag thresholds are not met. > > This patch is largely based on ideas from Michal Hocko posted here: > https://lore.kernel.org/linux-mm/20161230131412.GI13301@dhcp22.suse.cz/ > > Testing done (on x86): > - Set /sys/kernel/mm/compaction/order-9/extfrag_{low,high} = {25, 30} > respectively. > - Use a test program to fragment memory: the program allocates all memory > and then for each 2M aligned section, frees 3/4 of base pages using > munmap. > - kcompactd0 detects fragmentation for order-9 > extfrag_high and starts > compaction till extfrag < extfrag_low for order-9. > > The patch has plenty of rough edges but posting it early to see if I'm > going in the right direction and to get some early feedback. > Is there an update to this proposal or non-RFC patch that has been posted for proactive compaction? We've had good success with periodically compacting memory on a regular cadence on systems with hugepages enabled. The cadence itself is defined by the admin but it causes khugepaged[*] to periodically wakeup and invoke compaction in an attempt to keep zones as defragmented as possible (perhaps more "proactive" than what is proposed here in an attempt to keep all memory as unfragmented as possible regardless of extfrag thresholds). It also avoids corner-cases where kcompactd could become more expensive than what is anticipated because it is unsuccessful at compacting memory yet the extfrag threshold is still exceeded. [*] Khugepaged instead of kcompactd only because this is only enabled for systems where transparent hugepages are enabled, probably better off in kcompactd to avoid duplicating work between two kthreads if there is already a need for background compaction.