Received: by 2002:a05:7412:8d10:b0:f3:1519:9f41 with SMTP id bj16csp4010626rdb; Mon, 11 Dec 2023 06:35:14 -0800 (PST) X-Google-Smtp-Source: AGHT+IFVAgZa6r7AIagCw4x4yIHjduaEWyYIDF6EHnK2XnE/3gtnbUe7d2a+6vCbh41KrdjitHQ6 X-Received: by 2002:a05:6a20:a11:b0:18f:9c4:d34e with SMTP id c17-20020a056a200a1100b0018f09c4d34emr4250174pzb.62.1702305314048; Mon, 11 Dec 2023 06:35:14 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1702305314; cv=none; d=google.com; s=arc-20160816; b=qnPMYwCaZF39MVsNffdJZ1Bd1fcyJ/+R57RE6meRRJuKSPFWqHBB9Q5MfWnouhjKRM PzszHRct/v33uZuXF60ldFmGbidyw4AhZs2Cp491zE1p4IV3MJ2HYXjqGlU0bPfyaPst ZNCT6x8/obEVwXqmgmoeAc32pZwre8jkfLubxGonB0H6PrbPSq1CkbRigFCTByowCSgB NbKiHf2uKHXgNr1IcUIcb+wKUGFyIJpL04dqqrX663eM0vOZfAYmwBkQ+T/8no57kROv wIhlamkJh8g3A+EKeZnkiRo6omxdycDTdzHrT2pvmN2jtyyY7q1jtGqcOYaY5RdtVVox Ljxg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=2chFAW6nomfA6igwhAmawml0spzXTHvMGYmVScTgkfQ=; fh=3+ce/N6Y+5ACO1LkdCSxk1xg9/5/fv4PvRidlAwc8U0=; b=x9y14jhf6EM62mGcxIqlU9GpnXqwc4JPu/mrvvztAsgewaq9iboTsFN5ahM90ONnDP vJ4hp9TrBE8L1f8ewmDBaa8Y5Np6iRyj+UWu12NSWdLTackFzb2UZiIK6Dtp7OzEDgYP AO4HIsS9q5kM2I7xVwVOMY7qkHE7vBieDwkgAT2SgQ3V0taYR+ybqMgd9S6mC8vHRTS6 HOYwJPLXKOBiHZe9XTrTdntcH+5wVXRQSv6GJQ9NUJlMxgdHi+L4aA9/dLqpIG/xG8SL /FlYj5JU7ZW6fUkeTNDUanjUmMAsAbXMS/Y6OBKMnUNl/ZzlH7DeyjOB4D/x2aDXi4Vc QxQg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20230601 header.b=U+PjGP21; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:6 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Return-Path: Received: from pete.vger.email (pete.vger.email. [2620:137:e000::3:6]) by mx.google.com with ESMTPS id e4-20020a63e004000000b005bd5a60d73csi6256887pgh.708.2023.12.11.06.35.13 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 11 Dec 2023 06:35:14 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:6 as permitted sender) client-ip=2620:137:e000::3:6; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20230601 header.b=U+PjGP21; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:6 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by pete.vger.email (Postfix) with ESMTP id 23C50804756D; Mon, 11 Dec 2023 06:35:11 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.11 at pete.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1343842AbjLKOel (ORCPT + 99 others); Mon, 11 Dec 2023 09:34:41 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:48202 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1343833AbjLKOej (ORCPT ); Mon, 11 Dec 2023 09:34:39 -0500 Received: from mail-qk1-x732.google.com (mail-qk1-x732.google.com [IPv6:2607:f8b0:4864:20::732]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 581AE5FF4; Mon, 11 Dec 2023 06:04:33 -0800 (PST) Received: by mail-qk1-x732.google.com with SMTP id af79cd13be357-77f43042268so294050685a.1; Mon, 11 Dec 2023 06:04:33 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1702303472; x=1702908272; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=2chFAW6nomfA6igwhAmawml0spzXTHvMGYmVScTgkfQ=; b=U+PjGP213GI4hNtIWUPe262LuMfDqhaBcZL9+eHsEASuREbvJPsTGGpO6LYbFvnTSK kDlX/urZjCs9RZVRgWsc+DHhQREBF3NG9ELO5uTM9zZa3vwiNy+9l81Sj9n6oor6ylkK vSNg62A5VwdEiCBXKC1whO1mUSNhjfrUiw8D5aWoCfLv9wxIVQIz+Z49B/i9F0uMM72P /Rc2Jvj+8OsacqKO4DwSFIhRKg4/6eR2RcMgFfwZRitQWy2A4iWmkw86qPK6DVtGVEAM AQOMXF6+gCVi4xrXqpiO2iKdVTsK7vvfQQ3zjl/1+HSIarQM5Un6GktJ2lGgdc5tyhjm dFiQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1702303472; x=1702908272; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=2chFAW6nomfA6igwhAmawml0spzXTHvMGYmVScTgkfQ=; b=N6VJdoAVtlSmG7KZ1zvrzemCjKZlBQDtzHoTsmB7H9c+MTi+LwGjJ+qXLX4SOUtqPy IRR6JC+71c1D7aHWCOJgj+LVlIx7jX+TsX0+2a2j8r3ZLK+oK1HvCJ5cG2KuMZpa7rDk AqylePy7kbs8NMS1ZAYOTY369r3GxG6DxAyW6kIjBR+IURW4UJS/SxJi/WDd/N2I7I61 GEjyBdOqBkYhzn5sZYIpGyrtYvQzFoIbDpeRahrNANl92y1Ce3NULixR+MDBBW3jkcX4 fqsyi0dvSGE9Cs7+GLLdnYkHiVjheh+FTNutxlwfRE9B5TQNTyZ9bqo9oTxyi0U6v1Yh frTA== X-Gm-Message-State: AOJu0YxEKiBfqb9DfQgbRHId7Fg1TgxzY280BYAPq9of4gbyyH0JCx1A Bwax+8DDTKezRvJCJEjbiSQ= X-Received: by 2002:ae9:f50b:0:b0:77e:fba3:3e7f with SMTP id o11-20020ae9f50b000000b0077efba33e7fmr5842778qkg.72.1702303472173; Mon, 11 Dec 2023 06:04:32 -0800 (PST) Received: from localhost ([2620:10d:c091:400::5:45f0]) by smtp.gmail.com with ESMTPSA id bm33-20020a05620a19a100b0077eff6eece8sm2931655qkb.62.2023.12.11.06.04.31 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 11 Dec 2023 06:04:31 -0800 (PST) From: Dan Schatzberg To: Johannes Weiner , Roman Gushchin , Yosry Ahmed , Huan Yang Cc: linux-kernel@vger.kernel.org, cgroups@vger.kernel.org, linux-mm@kvack.org, Tejun Heo , Zefan Li , Jonathan Corbet , Michal Hocko , Shakeel Butt , Muchun Song , Andrew Morton , David Hildenbrand , Matthew Wilcox , Kefeng Wang , "Vishal Moola (Oracle)" , Mina Almasry , Yue Zhao , Hugh Dickins Subject: [PATCH V3 0/1] Add swappiness argument to memory.reclaim Date: Mon, 11 Dec 2023 06:04:14 -0800 Message-Id: <20231211140419.1298178-1-schatzberg.dan@gmail.com> X-Mailer: git-send-email 2.34.1 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-0.6 required=5.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on pete.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (pete.vger.email [0.0.0.0]); Mon, 11 Dec 2023 06:35:11 -0800 (PST) Changes since V2: * No functional change * Used int consistently rather than a pointer Changes since V1: * Added documentation This patch proposes augmenting the memory.reclaim interface with a swappiness= argument that overrides the swappiness value for that instance of proactive reclaim. Userspace proactive reclaimers use the memory.reclaim interface to trigger reclaim. The memory.reclaim interface does not allow for any way to effect the balance of file vs anon during proactive reclaim. The only approach is to adjust the vm.swappiness setting. However, there are a few reasons we look to control the balance of file vs anon during proactive reclaim, separately from reactive reclaim: * Swapout should be limited to manage SSD write endurance. In near-OOM situations we are fine with lots of swap-out to avoid OOMs. As these are typically rare events, they have relatively little impact on write endurance. However, proactive reclaim runs continuously and so its impact on SSD write endurance is more significant. Therefore it is desireable to control swap-out for proactive reclaim separately from reactive reclaim * Some userspace OOM killers like systemd-oomd[1] support OOM killing on swap exhaustion. This makes sense if the swap exhaustion is triggered due to reactive reclaim but less so if it is triggered due to proactive reclaim (e.g. one could see OOMs when free memory is ample but anon is just particularly cold). Therefore, it's desireable to have proactive reclaim reduce or stop swap-out before the threshold at which OOM killing occurs. In the case of Meta's Senpai proactive reclaimer, we adjust vm.swappiness before writes to memory.reclaim[2]. This has been in production for nearly two years and has addressed our needs to control proactive vs reactive reclaim behavior but is still not ideal for a number of reasons: * vm.swappiness is a global setting, adjusting it can race/interfere with other system administration that wishes to control vm.swappiness. In our case, we need to disable Senpai before adjusting vm.swappiness. * vm.swappiness is stateful - so a crash or restart of Senpai can leave a misconfigured setting. This requires some additional management to record the "desired" setting and ensure Senpai always adjusts to it. With this patch, we avoid these downsides of adjusting vm.swappiness globally. Previously, this exact interface addition was proposed by Yosry[3]. In response, Roman proposed instead an interface to specify precise file/anon/slab reclaim amounts[4]. More recently Huan also proposed this as well[5] and others similarly questioned if this was the proper interface. Previous proposals sought to use this to allow proactive reclaimers to effectively perform a custom reclaim algorithm by issuing proactive reclaim with different settings to control file vs anon reclaim (e.g. to only reclaim anon from some applications). Responses argued that adjusting swappiness is a poor interface for custom reclaim. In contrast, I argue in favor of a swappiness setting not as a way to implement custom reclaim algorithms but rather to bias the balance of anon vs file due to differences of proactive vs reactive reclaim. In this context, swappiness is the existing interface for controlling this balance and this patch simply allows for it to be configured differently for proactive vs reactive reclaim. Specifying explicit amounts of anon vs file pages to reclaim feels inappropriate for this prupose. Proactive reclaimers are un-aware of the relative age of file vs anon for a cgroup which makes it difficult to manage proactive reclaim of different memory pools. A proactive reclaimer would need some amount of anon reclaim attempts separate from the amount of file reclaim attempts which seems brittle given that it's difficult to observe the impact. [1]https://www.freedesktop.org/software/systemd/man/latest/systemd-oomd.service.html [2]https://github.com/facebookincubator/oomd/blob/main/src/oomd/plugins/Senpai.cpp#L585-L598 [3]https://lore.kernel.org/linux-mm/CAJD7tkbDpyoODveCsnaqBBMZEkDvshXJmNdbk51yKSNgD7aGdg@mail.gmail.com/ [4]https://lore.kernel.org/linux-mm/YoPHtHXzpK51F%2F1Z@carbon/ [5]https://lore.kernel.org/lkml/20231108065818.19932-1-link@vivo.com/ Dan Schatzberg (1): mm: add swapiness= arg to memory.reclaim Documentation/admin-guide/cgroup-v2.rst | 15 ++++++- include/linux/swap.h | 3 +- mm/memcontrol.c | 55 ++++++++++++++++++++----- mm/vmscan.c | 13 +++++- 4 files changed, 70 insertions(+), 16 deletions(-) -- 2.34.1