Received: by 2002:a05:6a10:6744:0:0:0:0 with SMTP id w4csp1335497pxu; Thu, 8 Oct 2020 09:02:02 -0700 (PDT) X-Google-Smtp-Source: ABdhPJzSY4T8pwU0YvCSngNwEYld6O63H/IBDyZ639+wC/XAIu7tGEEr7ZCZ0JRt3qd/2ynLsCdP X-Received: by 2002:a1c:e905:: with SMTP id q5mr9433991wmc.15.1602172922433; Thu, 08 Oct 2020 09:02:02 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1602172922; cv=none; d=google.com; s=arc-20160816; b=Wlbdmn+aM8WJnG86Gk88zaTj+GhVo+meHKHFRdA2BA/kmet3XKB3Zy1KEvgD+K2ynb SiPRl4uYnSehCi2sPKEc9pS8tsvA8EUaRtmCEse5+6jeNiyAmc+H/zd5k3ic5COBRXju fGiEWtLNrC6A/lEDofDSqHEp4YPkVZRmASHW6c2fEge6N48yw05l41Qoe8kkWwrnxmdy tDurNEYies0590RPausV5DNK2IRCxrITWmZp3nGLQKcHgZXKiWWGee+tRCPTAdkfbJ1R SBZ7ZDfRy5VsXOh9fq5iR6uGO0vT7z5AG4NqW8qL01bNDwlU2lxT+zWCJlweJ18pDC1B hKPA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=ua8VNvIgbxYgAvBR1LdP3f0lw9rNzuY+FsL2FkuuKgc=; b=kjBDb8SrSP3d/n3YrhlX0F38om9HvddrKR+x1ASUZk5lTvuwbIYGpj7ttquFY2eHki Urf2bAJXcKLxEnAOpFRdlK/KROmM46QNvxMpvxZcyHhGX4fzikzNn9aQQe2cj6rsgJrZ bM1p2hk974efNLUZhmZr2FXsXG/9+PLo4ioPAO8cAh/Mu8f5EpFAhHaAcJdMHza2jvjs vyt5W+mOCtTISY5LlPEoksDF4pjPFEtoPHXJ4HAtZGh4yiv8aFCqF4puIgYpTG3g7m3U /8cH8rO0WuGgi35mdOvv1h6zyu8gBIF2IHvSCDQS5kKnppLx+bpCN8MrEnKEbNXWEWO/ Nhjw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b="nCnHGAJ/"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id f7si3832506edw.591.2020.10.08.09.01.34; Thu, 08 Oct 2020 09:02:02 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b="nCnHGAJ/"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1731102AbgJHP5k (ORCPT + 99 others); Thu, 8 Oct 2020 11:57:40 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:42062 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1730562AbgJHP4K (ORCPT ); Thu, 8 Oct 2020 11:56:10 -0400 Received: from mail-lf1-x144.google.com (mail-lf1-x144.google.com [IPv6:2a00:1450:4864:20::144]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A260FC0613D5 for ; Thu, 8 Oct 2020 08:56:10 -0700 (PDT) Received: by mail-lf1-x144.google.com with SMTP id 184so7071312lfd.6 for ; Thu, 08 Oct 2020 08:56:10 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=ua8VNvIgbxYgAvBR1LdP3f0lw9rNzuY+FsL2FkuuKgc=; b=nCnHGAJ/yBurVMCl5kStVg/46KC2El3gkCBnP53vS25tn//WMkF7s8yzX7p7CvDbnG q6FJUw2/6CiR21Cm9mmWN4Mzva9K1FGe5zhBl55R5xSrxWAM6dRoPoy/PaGFnJ+Rflau 2VwIUJldFGM9dWLAEvA2qjnvRiZ5nfNbHwRvetcM1x23nzGTjp8GrOXW4sZneucjRutN UkRRejkmqwXxRfE0PO+pemHwC2tMmiq25I32bpzlbxGYLl4IqHVaejIuJzZTIrJs2oE/ ys19ayvNy0zx7gTemXEztd+W0K96hUEh/Ce3w6755oDEAGxGo2pjRVb2xutjKaFXsToO 05+g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=ua8VNvIgbxYgAvBR1LdP3f0lw9rNzuY+FsL2FkuuKgc=; b=O3tp9UZaE3Ja2MtxhyJZ8wuhWN53bmfCdA2TJDsV7IStP5qCeAHbjQ3n+HXiMNSqar QRHSWfIlAXP7o+AfuLfpPIPkI+7s9D5QZAg41aWjEX6oFj6Irf4ISv+rGXd1nhGultdf z2Z5Q2u+hyMVzFYEVITBbfDvyM4o0TXkxW9/HOSxLGkSDcVOx+SbJtzDqiXqp1mvTs0i uWI2Op+86g1pm9etoZhrOsFJOfCPcf64MncTfPI7gXoY4L9TYT9Fw1oovJ1Yo08edo0q VpTgMHvymdimLalDtjOGoFYYRhTqiotmlgU/LpeefGYzFYyJ3sdVfInXCgoMvM/g2D0V y2eg== X-Gm-Message-State: AOAM531zJnYwOGZOQsNC67/ZmrnKe0Gn5r6YKPs527GHtf+vhYUwRS6r q1LZiNR/vz1TNlYaxXVDggPcS0a+GE+0HYr29ThJWA== X-Received: by 2002:a19:7719:: with SMTP id s25mr980834lfc.521.1602172568694; Thu, 08 Oct 2020 08:56:08 -0700 (PDT) MIME-Version: 1.0 References: <20200909215752.1725525-1-shakeelb@google.com> <20200928210216.GA378894@cmpxchg.org> <20200929150444.GG2277@dhcp22.suse.cz> <20200929215341.GA408059@cmpxchg.org> <20201001143149.GA493631@cmpxchg.org> <20201008145336.GA163830@cmpxchg.org> In-Reply-To: <20201008145336.GA163830@cmpxchg.org> From: Shakeel Butt Date: Thu, 8 Oct 2020 08:55:57 -0700 Message-ID: Subject: Re: [PATCH] memcg: introduce per-memcg reclaim interface To: Johannes Weiner Cc: Michal Hocko , Roman Gushchin , Yang Shi , Greg Thelen , David Rientjes , =?UTF-8?Q?Michal_Koutn=C3=BD?= , Andrew Morton , Linux MM , Cgroups , LKML , Andrea Righi , SeongJae Park Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Oct 8, 2020 at 7:55 AM Johannes Weiner wrote: > > On Tue, Oct 06, 2020 at 09:55:43AM -0700, Shakeel Butt wrote: > > On Thu, Oct 1, 2020 at 7:33 AM Johannes Weiner wrote: > > > > > [snip] > > > > > So instead of asking users for a target size whose suitability > > > > > heavily depends on the kernel's LRU implementation, the readahead > > > > > code, the IO device's capability and general load, why not directly > > > > > ask the user for a pressure level that the workload is comfortable > > > > > with and which captures all of the above factors implicitly? Then > > > > > let the kernel do this feedback loop from a per-cgroup worker. > > > > > > > > I am assuming here by pressure level you are referring to the PSI like > > > > interface e.g. allowing the users to tell about their jobs that X > > > > amount of stalls in a fixed time window is tolerable. > > > > > > Right, essentially the same parameters that psi poll() would take. > > > > I thought a bit more on the semantics of the psi usage for the > > proactive reclaim. > > > > Suppose I have a top level cgroup A on which I want to enable > > proactive reclaim. Which memory psi events should the proactive > > reclaim should consider? > > > > The simplest would be the memory.psi at 'A'. However memory.psi is > > hierarchical and I would not really want the pressure due limits in > > children of 'A' to impact the proactive reclaim. > > I don't think pressure from limits down the tree can be separated out, > generally. All events are accounted recursively as well. Of course, we > remember the reclaim level for evicted entries - but if there is > reclaim triggered at A and A/B concurrently, the distribution of who > ends up reclaiming the physical pages in A/B is pretty arbitrary/racy. > > If A/B decides to do its own proactive reclaim with the sublimit, and > ends up consuming the pressure budget assigned to proactive reclaim in > A, there isn't much that can be done. > > It's also possible that proactive reclaim in A keeps A/B from hitting > its limit in the first place. > > I have to say, the configuration doesn't really strike me as sensible, > though. Limits make sense for doing fixed partitioning: A gets 4G, A/B > gets 2G out of that. But if you do proactive reclaim on A you're > essentially saying A as a whole is auto-sizing dynamically based on > its memory access pattern. I'm not sure what it means to then start > doing fixed partitions in the sublevel. > Think of the scenario where there is an infrastructure owner and the large number of job owners. The aim of the infra owner is to reduce cost by stuffing as many jobs as possible on the same machine while job owners want consistent performance. The job owners usually have meta jobs i.e. a set of small jobs that run on the same machines and they manage these sub-jobs themselves. The infra owner wants to do proactive reclaim to trim the current jobs without impacting their performance and more importantly to have enough memory to land new jobs (We have learned the hard way that depending on global reclaim for memory overcommit is really bad for isolation). In the above scenario the configuration you mentioned might not be sensible is really possible. This is exactly what we have in prod. You can also get the idea why I am asking for flexibility for the cost of proactive reclaim.