Received: by 2002:a05:6358:d09b:b0:dc:cd0c:909e with SMTP id jc27csp2772104rwb; Fri, 2 Dec 2022 15:02:37 -0800 (PST) X-Google-Smtp-Source: AA0mqf4RePDXCeHhc0KF+xKuO1rPQfAdT065PGdxj8J5jbKg1mXfYj9b50bnPDwALaGNKBQlUZOJ X-Received: by 2002:a17:906:3b88:b0:7ae:83:5be7 with SMTP id u8-20020a1709063b8800b007ae00835be7mr59616430ejf.139.1670022157515; Fri, 02 Dec 2022 15:02:37 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1670022157; cv=none; d=google.com; s=arc-20160816; b=xug3k2oZ/uLpkV6My0Y6DkCh7RfcwK37R9C+KIbvXNvtVLvHc8Y4Ukl/pVX26xHECY 84X66C1wcubc9O4NTXjBj22nOWZFtR0ASIZOuuZjPv+lyBPb7pxgPDH8eB03Ms41/+6N YCHRjAgJol3JTLCJ/5zMX5a9Ewf1XmsGjL8nzx9wh0EeiWurZFsIRwNedCWAidbV49fW AYPK/yA+9SF75NGbLUnRYZvUW5/u3qp4nXFKtSJENhzrPmDYDBKxwiBYg6SfdjcDuUEf GkeORLEhwmI0QpOYHWrL9yIQWQTRlgjgnTsqqP7ycQf0fWeZlEqww3mW/XmWQMGZ++BD Tmkg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:from:subject:message-id:mime-version:date :dkim-signature; bh=+RO437GEZ+WElFZehPW6Ok02BB1y3djf+6Wg2yQ+h9s=; b=FMg6dIjRlpmmokouz84Sbp8hWCe5ApY68Vvj0hAZ1OGHmF3UvQWPW/MzA3/VlLATMH PEloDtEIJFQtQa5KV4Xp67ma3AeL37wTp3Rgoabgwp6rWVhqBs/qDiOZtgyRYwbVSObi ip2kztsnRLqqaM/tm80l17wjLT22Iw5jN8ShBtie1hyIqlG1LvrIiiwmCqDvbrjTPRfZ FEhcNqOiEsa9QnjpYZNaPJ8G/9xl1O9X8JQDV8CODg2DDWp0GKC8KgiPWvQlWAFUNkBw L7CoyqRBrsSh+myVR/e5XZq4BEWLXnq7FvDmJFIeByT1ykuB8gaI0zurp5Lod6O4Vha2 CinA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=LHn2nRg2; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id sg12-20020a170907a40c00b007918f58458bsi7521686ejc.441.2022.12.02.15.02.18; Fri, 02 Dec 2022 15:02:37 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=LHn2nRg2; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234739AbiLBWhU (ORCPT + 82 others); Fri, 2 Dec 2022 17:37:20 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:60748 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234753AbiLBWhP (ORCPT ); Fri, 2 Dec 2022 17:37:15 -0500 Received: from mail-yb1-xb49.google.com (mail-yb1-xb49.google.com [IPv6:2607:f8b0:4864:20::b49]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 8CB01EBCB2 for ; Fri, 2 Dec 2022 14:37:14 -0800 (PST) Received: by mail-yb1-xb49.google.com with SMTP id y6-20020a25b9c6000000b006c1c6161716so6522838ybj.8 for ; Fri, 02 Dec 2022 14:37:14 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:message-id:mime-version:date:from:to:cc:subject :date:message-id:reply-to; bh=+RO437GEZ+WElFZehPW6Ok02BB1y3djf+6Wg2yQ+h9s=; b=LHn2nRg2oFSaB0dJIO0yooCpv3dQ9QXMByKCwdl+egE/WOxwP4FVQGWDYz6Y9tkPE5 Jgp0CK7S4yTl0+kb59ClT1OaReoxyzdt+FA3VldMFUb3CnSkvhWbbGUhX/Xy35QjivbK TO20TuKzlLUlQOLWrz0krvOhwUBh7koeYBCRszO+0ZfI78AbvD3Oqx3YxjNGYNKY5M87 9/Vr//ig9WyTbABAr+ryFE+y1QKpod6J16FyKzolUGaMMhBukvF/UQe6iw3yqO5cm5hD Dlel/jBqm1zRF7mqaUmelu3Yt6iL0EO5IjYDyTYElc9xM5l/JhwKJVbiEHqf4k/4b4Gn ursQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:message-id:mime-version:date:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=+RO437GEZ+WElFZehPW6Ok02BB1y3djf+6Wg2yQ+h9s=; b=yvgtew7z6amtNEPaE4/SCFBKax/QRgIRCmBNAD3QNoAA/F498bTxcThxicdlnfhj8p N/T6xKIO4R2M4Vt6pepi2N0ioQ8vaYCmCwkirKRV1+AsXsn5HFBev8YT+k+41quCR2Sk AKD5mEiDwLSBrd1gmdY3yQ8L43x5d+8CMyH5YbxTV08/R1chyaqNg7jGBkN8EJ0pVZOF XfhxnD7yz3ZHyU8VavrBGisGmUjYPYJkK3kCqe5MCEdpu4KbCuCDw/O+1uTsifBPXkA4 e9gbQj2+OJ4VoCtz3REgY/euaexFFWTxBzLNjZE9cUkOE5P5jOvseH1jwp1IfkyfQmY8 9pIw== X-Gm-Message-State: ANoB5pnx9Kq1pcnN2ckCDuSAi7bDMtpp8edUxvQEJEzyRITUQjiSN9Vs KK5nzfToMZ5UPDRV+B8Ui7jcjlm+u88Ma9GxbA== X-Received: from almasrymina.svl.corp.google.com ([2620:15c:2d4:203:e655:31e2:2ad4:2421]) (user=almasrymina job=sendgmr) by 2002:a81:fe09:0:b0:3b0:b9aa:46eb with SMTP id j9-20020a81fe09000000b003b0b9aa46ebmr44197742ywn.150.1670020633813; Fri, 02 Dec 2022 14:37:13 -0800 (PST) Date: Fri, 2 Dec 2022 14:35:31 -0800 Mime-Version: 1.0 X-Mailer: git-send-email 2.39.0.rc0.267.gcb52ba06e7-goog Message-ID: <20221202223533.1785418-1-almasrymina@google.com> Subject: [PATCH v3] mm: Add nodes= arg to memory.reclaim From: Mina Almasry To: Tejun Heo , Zefan Li , Johannes Weiner , Jonathan Corbet , Michal Hocko , Roman Gushchin , Shakeel Butt , Muchun Song , Andrew Morton Cc: Huang Ying , Yang Shi , Yosry Ahmed , weixugc@google.com, fvdl@google.com, Mina Almasry , Michal Hocko , bagasdotme@gmail.com, cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org The nodes= arg instructs the kernel to only scan the given nodes for proactive reclaim. For example use cases, consider a 2 tier memory system: nodes 0,1 -> top tier nodes 2,3 -> second tier $ echo "1m nodes=0" > memory.reclaim This instructs the kernel to attempt to reclaim 1m memory from node 0. Since node 0 is a top tier node, demotion will be attempted first. This is useful to direct proactive reclaim to specific nodes that are under pressure. $ echo "1m nodes=2,3" > memory.reclaim This instructs the kernel to attempt to reclaim 1m memory in the second tier, since this tier of memory has no demotion targets the memory will be reclaimed. $ echo "1m nodes=0,1" > memory.reclaim Instructs the kernel to reclaim memory from the top tier nodes, which can be desirable according to the userspace policy if there is pressure on the top tiers. Since these nodes have demotion targets, the kernel will attempt demotion first. Since commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg reclaim""), the proactive reclaim interface memory.reclaim does both reclaim and demotion. Reclaim and demotion incur different latency costs to the jobs in the cgroup. Demoted memory would still be addressable by the userspace at a higher latency, but reclaimed memory would need to incur a pagefault. The 'nodes' arg is useful to allow the userspace to control demotion and reclaim independently according to its policy: if the memory.reclaim is called on a node with demotion targets, it will attempt demotion first; if it is called on a node without demotion targets, it will only attempt reclaim. Acked-by: Michal Hocko Signed-off-by: Mina Almasry --- v3: - Dropped RFC tag from subject. - Added Michal's Ack - Applied some of Bagas's comment suggestions. - Converted try_to_fre_mem_cgorup_pages() to take nodemask_t* instead of nodemask_t as Shakeel and Muchun suggested. Cc: bagasdotme@gmail.com Thanks for the comments and reviews. --- Documentation/admin-guide/cgroup-v2.rst | 15 +++--- include/linux/swap.h | 3 +- mm/memcontrol.c | 67 ++++++++++++++++++++----- mm/vmscan.c | 4 +- 4 files changed, 68 insertions(+), 21 deletions(-) diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 74cec76be9f2..c8ae7c897f14 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1245,17 +1245,13 @@ PAGE_SIZE multiple when read back. This is a simple interface to trigger memory reclaim in the target cgroup. - This file accepts a single key, the number of bytes to reclaim. - No nested keys are currently supported. + This file accepts a string which contains the number of bytes to + reclaim. Example:: echo "1G" > memory.reclaim - The interface can be later extended with nested keys to - configure the reclaim behavior. For example, specify the - type of memory to reclaim from (anon, file, ..). - Please note that the kernel can over or under reclaim from the target cgroup. If less bytes are reclaimed than the specified amount, -EAGAIN is returned. @@ -1267,6 +1263,13 @@ PAGE_SIZE multiple when read back. This means that the networking layer will not adapt based on reclaim induced by memory.reclaim. + This file also allows the user to specify the nodes to reclaim from, + via the 'nodes=' key, for example:: + + echo "1G nodes=0,1" > memory.reclaim + + The above instructs the kernel to reclaim memory from nodes 0,1. + memory.peak A read-only single value file which exists on non-root cgroups. diff --git a/include/linux/swap.h b/include/linux/swap.h index 0ceed49516ad..2787b84eaf12 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -418,7 +418,8 @@ extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order, extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, unsigned long nr_pages, gfp_t gfp_mask, - unsigned int reclaim_options); + unsigned int reclaim_options, + nodemask_t *nodemask); extern unsigned long mem_cgroup_shrink_node(struct mem_cgroup *mem, gfp_t gfp_mask, bool noswap, pg_data_t *pgdat, diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 23750cec0036..0f02f47a87e4 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -63,6 +63,7 @@ #include #include #include +#include #include "internal.h" #include #include @@ -2392,7 +2393,8 @@ static unsigned long reclaim_high(struct mem_cgroup *memcg, psi_memstall_enter(&pflags); nr_reclaimed += try_to_free_mem_cgroup_pages(memcg, nr_pages, gfp_mask, - MEMCG_RECLAIM_MAY_SWAP); + MEMCG_RECLAIM_MAY_SWAP, + NULL); psi_memstall_leave(&pflags); } while ((memcg = parent_mem_cgroup(memcg)) && !mem_cgroup_is_root(memcg)); @@ -2683,7 +2685,8 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask, psi_memstall_enter(&pflags); nr_reclaimed = try_to_free_mem_cgroup_pages(mem_over_limit, nr_pages, - gfp_mask, reclaim_options); + gfp_mask, reclaim_options, + NULL); psi_memstall_leave(&pflags); if (mem_cgroup_margin(mem_over_limit) >= nr_pages) @@ -3503,7 +3506,8 @@ static int mem_cgroup_resize_max(struct mem_cgroup *memcg, } if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL, - memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP)) { + memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP, + NULL)) { ret = -EBUSY; break; } @@ -3614,7 +3618,8 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg) return -EINTR; if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL, - MEMCG_RECLAIM_MAY_SWAP)) + MEMCG_RECLAIM_MAY_SWAP, + NULL)) nr_retries--; } @@ -6407,7 +6412,8 @@ static ssize_t memory_high_write(struct kernfs_open_file *of, } reclaimed = try_to_free_mem_cgroup_pages(memcg, nr_pages - high, - GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP); + GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP, + NULL); if (!reclaimed && !nr_retries--) break; @@ -6456,7 +6462,8 @@ static ssize_t memory_max_write(struct kernfs_open_file *of, if (nr_reclaims) { if (!try_to_free_mem_cgroup_pages(memcg, nr_pages - max, - GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP)) + GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP, + NULL)) nr_reclaims--; continue; } @@ -6579,21 +6586,54 @@ static ssize_t memory_oom_group_write(struct kernfs_open_file *of, return nbytes; } +enum { + MEMORY_RECLAIM_NODES = 0, + MEMORY_RECLAIM_NULL, +}; + +static const match_table_t if_tokens = { + { MEMORY_RECLAIM_NODES, "nodes=%s" }, + { MEMORY_RECLAIM_NULL, NULL }, +}; + static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf, size_t nbytes, loff_t off) { struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); unsigned int nr_retries = MAX_RECLAIM_RETRIES; unsigned long nr_to_reclaim, nr_reclaimed = 0; - unsigned int reclaim_options; - int err; + unsigned int reclaim_options = MEMCG_RECLAIM_MAY_SWAP | + MEMCG_RECLAIM_PROACTIVE; + char *old_buf, *start; + substring_t args[MAX_OPT_ARGS]; + int token; + char value[256]; + nodemask_t nodemask = NODE_MASK_ALL; buf = strstrip(buf); - err = page_counter_memparse(buf, "", &nr_to_reclaim); - if (err) - return err; - reclaim_options = MEMCG_RECLAIM_MAY_SWAP | MEMCG_RECLAIM_PROACTIVE; + old_buf = buf; + nr_to_reclaim = memparse(buf, &buf) / PAGE_SIZE; + if (buf == old_buf) + return -EINVAL; + + buf = strstrip(buf); + + while ((start = strsep(&buf, " ")) != NULL) { + if (!strlen(start)) + continue; + token = match_token(start, if_tokens, args); + match_strlcpy(value, args, sizeof(value)); + switch (token) { + case MEMORY_RECLAIM_NODES: + if (nodelist_parse(value, nodemask) < 0) + return -EINVAL; + break; + default: + return -EINVAL; + } + } + while (nr_reclaimed < nr_to_reclaim) { unsigned long reclaimed; @@ -6610,7 +6650,8 @@ static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf, reclaimed = try_to_free_mem_cgroup_pages(memcg, nr_to_reclaim - nr_reclaimed, - GFP_KERNEL, reclaim_options); + GFP_KERNEL, reclaim_options, + &nodemask); if (!reclaimed && !nr_retries--) return -EAGAIN; diff --git a/mm/vmscan.c b/mm/vmscan.c index 7b8e8e43806b..62b0c9b46bd2 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -6735,7 +6735,8 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg, unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, unsigned long nr_pages, gfp_t gfp_mask, - unsigned int reclaim_options) + unsigned int reclaim_options, + nodemask_t *nodemask) { unsigned long nr_reclaimed; unsigned int noreclaim_flag; @@ -6750,6 +6751,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, .may_unmap = 1, .may_swap = !!(reclaim_options & MEMCG_RECLAIM_MAY_SWAP), .proactive = !!(reclaim_options & MEMCG_RECLAIM_PROACTIVE), + .nodemask = nodemask, }; /* * Traverse the ZONELIST_FALLBACK zonelist of the current node to put -- 2.39.0.rc0.267.gcb52ba06e7-goog