Received: by 2002:a05:6a10:6744:0:0:0:0 with SMTP id w4csp1367594pxu; Thu, 8 Oct 2020 09:46:09 -0700 (PDT) X-Google-Smtp-Source: ABdhPJwvxDjvE5cmhm10fhF+knngVlFPaxHGsWLbXGy5qN7rdigUc5iZo2hwtF4KBbtP/FO31MMB X-Received: by 2002:a17:906:4a97:: with SMTP id x23mr9420170eju.471.1602175568780; Thu, 08 Oct 2020 09:46:08 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1602175568; cv=none; d=google.com; s=arc-20160816; b=NHc3XwIbpGjB7bxH3Y/lHghh+pvysalUKPUJ14P0xbJaQ2j8z+gAsxQvcJH/UHiJOw WbaZisOXli05/WPc8R+SLPAm+TxR8yZ+yATqYH1pEsv8REhjbZ5ITWQL6SeX1n3+KjHb RGXx9GitDqN7baMm/LHWbEJiD9lpFPs+Oxx/FVvsX/uFODP2JRS0mNh3Odzf5VmIL0pp 2hsyfLmDTJT3C/kzQhG3YkChYnAAFE0L/17CNOG7L4yjRedO7Y3FSMKLalwdR/rqdeCR V9p6l1QXXrf46hU9Oa7QqTmSAutemRqZO7V8uJMnXw5koYJFbwjlHxPOgty9rSdggm3f gMoA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=uyKHtfoarsj+UYs2PP4XCdlBcu5iL8M8TqmjexyJtjo=; b=t/jlGWkHDZPzfXjEmZk5ct+QNHsrHJFVTEimGV9C1H53zZVqtjzjgL1Tudjes09umj 6JbBeVAdXx6P5d4bNI/NUp9+akqAAyPkK/JEZIvICSHhZnlfbP3pzFq1EHlVgiO2XBOU wo7O5hzX1zBLQa87S7dKOryqjb0bs1rhqeGCmPO6WPHKZTRbzZfwTQZNita+v+hD5XI0 q6sUsM7mmx0u7MAgTslrGP1LjW+jFxyWt9gUQy+z04R4MluRZmYCWA/QlGrVNypa9zaX u/TX+KhCCBdGt/C7/mLJLxzqpWTMy3fpjRDWAFcd5IfhWWzcS5IlTTXd7NC77bWOXpvP lAxg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@cmpxchg-org.20150623.gappssmtp.com header.s=20150623 header.b=gSpOReNk; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=cmpxchg.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id l20si1627459edv.312.2020.10.08.09.45.44; Thu, 08 Oct 2020 09:46:08 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@cmpxchg-org.20150623.gappssmtp.com header.s=20150623 header.b=gSpOReNk; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=cmpxchg.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1730739AbgJHOzK (ORCPT + 99 others); Thu, 8 Oct 2020 10:55:10 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:60592 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1730353AbgJHOzK (ORCPT ); Thu, 8 Oct 2020 10:55:10 -0400 Received: from mail-qk1-x743.google.com (mail-qk1-x743.google.com [IPv6:2607:f8b0:4864:20::743]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 933FEC0613D2 for ; Thu, 8 Oct 2020 07:55:09 -0700 (PDT) Received: by mail-qk1-x743.google.com with SMTP id w12so7283267qki.6 for ; Thu, 08 Oct 2020 07:55:09 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cmpxchg-org.20150623.gappssmtp.com; s=20150623; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to; bh=uyKHtfoarsj+UYs2PP4XCdlBcu5iL8M8TqmjexyJtjo=; b=gSpOReNk6UlU6+gBPtDU3TmLons5StnRVP+CSGLouttWxPrXlWzOUzwAitmC+kccP2 0zmeEv8fvogdspFm1WObJnfz37SM5ZQXGJVkNbRt3U9rIh7Pyw4kD0vw0G9kikekdRdl C2gz/68+9dh12edgZNkL+yu9ynPSKifE62fxva7Ow56HS1u8kUWSsGm2yoI9rtaJKJiN nJA/QAp95vu+Fs1ILvQcjFYZDG+JxrV4BqbTRG3oG/QYOK4UZCI4Cc9geDfeNj3qsVsq MFI8DWNCMjRkARF4wR2mrx1S+YWbkkW2KhHKaCXXeQQUOCxJQtRVVlpJiKio3PpqqbwF TK5A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=uyKHtfoarsj+UYs2PP4XCdlBcu5iL8M8TqmjexyJtjo=; b=lqtEa4E7G4RzSKOJeCE3xxWRe+pT0PDd0PlRDzUxHEL9twbLNGPSfyrNEfwSZHb8GT 02/SqMC+pi5dxVh1hI2Q+E0r1mqTZNimKHM9AZDlzamFdDk9gB5mWpolJwmTzZgi1Vsk uIWCZ/6welpNExJCq4FmA7yB/XfjNx6ljmuu8ErbVCs/FEhTjKKEod6QSmPbhhfaUm7l W799q8qFJuZymXQPVjD6F4WWZi5d1E9YDRIK0HyCb9+20FSUrByaJyWtq0VDveZJrTv6 9DI5tlf7Q0gvFFKCKoBtYnTeWo5uMiw29t2DKcjFyEPRQmqIqP1KPdmIiJiQLOOyZrwk 8zcA== X-Gm-Message-State: AOAM532XhaZSzwQdXxeknyOL0PaDyXvf/aWbDLwbAoJ8cm92zymJLPwb xa4xYM2Cpq38yn12yUrQ9dh64w== X-Received: by 2002:ae9:e644:: with SMTP id x4mr8391439qkl.270.1602168908459; Thu, 08 Oct 2020 07:55:08 -0700 (PDT) Received: from localhost ([2620:10d:c091:480::1:9294]) by smtp.gmail.com with ESMTPSA id e23sm3955591qkl.67.2020.10.08.07.55.07 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 08 Oct 2020 07:55:07 -0700 (PDT) Date: Thu, 8 Oct 2020 10:53:36 -0400 From: Johannes Weiner To: Shakeel Butt Cc: Michal Hocko , Roman Gushchin , Yang Shi , Greg Thelen , David Rientjes , Michal =?iso-8859-1?Q?Koutn=FD?= , Andrew Morton , Linux MM , Cgroups , LKML , Andrea Righi , SeongJae Park Subject: Re: [PATCH] memcg: introduce per-memcg reclaim interface Message-ID: <20201008145336.GA163830@cmpxchg.org> References: <20200909215752.1725525-1-shakeelb@google.com> <20200928210216.GA378894@cmpxchg.org> <20200929150444.GG2277@dhcp22.suse.cz> <20200929215341.GA408059@cmpxchg.org> <20201001143149.GA493631@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Oct 06, 2020 at 09:55:43AM -0700, Shakeel Butt wrote: > On Thu, Oct 1, 2020 at 7:33 AM Johannes Weiner wrote: > > > [snip] > > > > So instead of asking users for a target size whose suitability > > > > heavily depends on the kernel's LRU implementation, the readahead > > > > code, the IO device's capability and general load, why not directly > > > > ask the user for a pressure level that the workload is comfortable > > > > with and which captures all of the above factors implicitly? Then > > > > let the kernel do this feedback loop from a per-cgroup worker. > > > > > > I am assuming here by pressure level you are referring to the PSI like > > > interface e.g. allowing the users to tell about their jobs that X > > > amount of stalls in a fixed time window is tolerable. > > > > Right, essentially the same parameters that psi poll() would take. > > I thought a bit more on the semantics of the psi usage for the > proactive reclaim. > > Suppose I have a top level cgroup A on which I want to enable > proactive reclaim. Which memory psi events should the proactive > reclaim should consider? > > The simplest would be the memory.psi at 'A'. However memory.psi is > hierarchical and I would not really want the pressure due limits in > children of 'A' to impact the proactive reclaim. I don't think pressure from limits down the tree can be separated out, generally. All events are accounted recursively as well. Of course, we remember the reclaim level for evicted entries - but if there is reclaim triggered at A and A/B concurrently, the distribution of who ends up reclaiming the physical pages in A/B is pretty arbitrary/racy. If A/B decides to do its own proactive reclaim with the sublimit, and ends up consuming the pressure budget assigned to proactive reclaim in A, there isn't much that can be done. It's also possible that proactive reclaim in A keeps A/B from hitting its limit in the first place. I have to say, the configuration doesn't really strike me as sensible, though. Limits make sense for doing fixed partitioning: A gets 4G, A/B gets 2G out of that. But if you do proactive reclaim on A you're essentially saying A as a whole is auto-sizing dynamically based on its memory access pattern. I'm not sure what it means to then start doing fixed partitions in the sublevel. > PSI due to refaults and slow IO should be included or maybe only > those which are caused by the proactive reclaim itself. I am > undecided on the PSI due to compaction. PSI due to global reclaim > for 'A' is even more complicated. This is a stall due to reclaiming > from the system including self. It might not really cause more > refaults and IOs for 'A'. Should proactive reclaim ignore the > pressure due to global pressure when tuning its aggressiveness. Yeah, I think they should all be included, because ultimately what matters is what the workload can tolerate without sacrificing performance. Proactive reclaim can destroy THPs, so the cost of recreating them should be reflected. Otherwise you can easily overpressurize. For global reclaim, if you say you want a workload pressurized to X percent in order to drive the LRUs and chop off all cold pages the workload can live without, it doesn't matter who does the work. If there is an abundance of physical memory, it's going to be proactive reclaim. If physical memory is already tight enough that global reclaim does it for you, there is nothing to be done in addition, and proactive reclaim should hang back. Otherwise you can again easily overpressurize the workload.