Received: by 2002:a05:6358:d09b:b0:dc:cd0c:909e with SMTP id jc27csp1417846rwb; Wed, 14 Dec 2022 09:55:58 -0800 (PST) X-Google-Smtp-Source: AA0mqf77vZQYU3rQXuEqZW1sYJ7kWQN/TIAZ9zw4Qv9dBkYa39nddF8bzMMAuXPHP6MhbcduKRgC X-Received: by 2002:a05:6402:1ccc:b0:461:8a1a:271f with SMTP id ds12-20020a0564021ccc00b004618a1a271fmr24039808edb.11.1671040557991; Wed, 14 Dec 2022 09:55:57 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1671040557; cv=none; d=google.com; s=arc-20160816; b=aUo0I+z3hYsHhLmnMX9C3cGy8tfwkeP1xpKAb1JJDlEoCIw3sEBmOe69r15W5UyBmI YRhXurny6byFyl3Z1bqqPuiwPfT4D66pZCOBHBmTNLMayLVS2VNzsIu+4HLRy4he+2iC sWzMDA+z6ADtsv1GcO3+JpkkmgeTYviR4QaO0HYIxWTbE3pfXDoVdxlO7uWHB4lgD8nr M22OjNB07SkPplzFTFcgshB39SYXIOxCzYHxTnP3oxKufTkTXDPiSHIvROLFyStz6F7W tnHVZi1qmNKsZUOXj0qnhkeiHZxtNX6nKCLEruxcGEBpRXgKjK0WANea0GcrSBMq3sh3 0NXw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=InYm6sLvKrBj85IOoD7cel+odEmWBz2TRkO1Wbhx8jE=; b=JHswNgsH/7w+MN4LOCd5u0mJSlMGQWcByU/1VsrP1i0VIO22XWUS9XwnzlXViWQq0X qpIgLZtzXh2dCoFiMyKmiKIdM9WV6HLosy/dGaZImvkOdJB4uL/JsSucenECl1guIes2 jysV3PT7rtmYKK5KJFLuxdnKPWQ2dpzC4hUk3riBDYZgtUuZuFKkyu2nzCzGw5zUaVeY NlvrnwJ/Djzj3CYnr6QnyCG9Lh1Hp7PC3yJ/9xMZIEQjo5mGqpx8PdDiP1qx051Pr5a2 ZkhTUt+9rab0OtKEgJixfcvmcnAU26eFA01z/a2fys99J/qqoheugKZdp0WwDqHv419U 8KZQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@cmpxchg-org.20210112.gappssmtp.com header.s=20210112 header.b=zLiEPdGW; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=cmpxchg.org Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id g7-20020aa7d1c7000000b0046af5c0f32asi12016902edp.37.2022.12.14.09.55.41; Wed, 14 Dec 2022 09:55:57 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@cmpxchg-org.20210112.gappssmtp.com header.s=20210112 header.b=zLiEPdGW; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=cmpxchg.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S238064AbiLNRk0 (ORCPT + 70 others); Wed, 14 Dec 2022 12:40:26 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:54578 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229558AbiLNRkY (ORCPT ); Wed, 14 Dec 2022 12:40:24 -0500 Received: from mail-ed1-x52d.google.com (mail-ed1-x52d.google.com [IPv6:2a00:1450:4864:20::52d]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 50F0019024 for ; Wed, 14 Dec 2022 09:40:22 -0800 (PST) Received: by mail-ed1-x52d.google.com with SMTP id d20so23664717edn.0 for ; Wed, 14 Dec 2022 09:40:22 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cmpxchg-org.20210112.gappssmtp.com; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=InYm6sLvKrBj85IOoD7cel+odEmWBz2TRkO1Wbhx8jE=; b=zLiEPdGWeIplHyN5iSpM7exWPqLSjBujwmhH9CaDqwcZLozyIDL1wj7f/A4x0f7xhw lkD8XiLJnLny851wmw6zGpDjB6arYxXyLU41sJvAl3YWqKMcu/bfpzPN2AAEobUO5Qpa W+OxWPzArHW5QPgSWm6iu8T9qq/PAFK9GbfHcrpe1gh9ygZhgbyJ1+3/VYviiZ0ZkQix c8YTQI07Tib+u3vBjLkP0pGoA5MjveOUWP7Z6mKVzXFz5vyydsDQvlty5Bk07Vobz3xz R2/3XxCeEWRcDRVRl7LaIao+kY00sliQuEY1kr8G/Vbrjx/ObRPeswtJpUQOXtGewE6f l1uw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=InYm6sLvKrBj85IOoD7cel+odEmWBz2TRkO1Wbhx8jE=; b=cWisGhzz490ZrdIZNpnjX+Gkm9aZckvV1rItHT0HjKwNr++P8W2votVAdiU6AiaPef XeY8FuzhYXL862oQqVAYQuGDoqN34RzpVkUI3Dmz9FzMq8YuX8a1nNI/BBGCxbHtcj5T E62ZxVHvF/TVhCzpQr/VzHl3wukknDRkn9JvMIvenh2Fm92Y7Hy89YMCUs+b21ySIzoZ 2YHImAKbAmcHmsZd490pf2W/I9gvtazEj1JxDVYgUu9Xklrl5T3zV8K/o8foFRb68tLy 2ufSRnhYesaaRcK/4qlIwadQZyn7JuO4D9xPBLhmyelsK+3lLKNvktVw5PvhDE+hQ8In lC4Q== X-Gm-Message-State: ANoB5pnv93mVGL8K1s21rG8kS9oR8rCSyT6lGQi2xq7o0FFEgIhewjRN 1HjujoZWWGxE1qZCEt9R/rIcDQ== X-Received: by 2002:a05:6402:194a:b0:461:a699:7c5c with SMTP id f10-20020a056402194a00b00461a6997c5cmr19543406edz.22.1671039620871; Wed, 14 Dec 2022 09:40:20 -0800 (PST) Received: from localhost (ip-046-005-139-011.um12.pools.vodafone-ip.de. [46.5.139.11]) by smtp.gmail.com with ESMTPSA id d10-20020a50f68a000000b0045b3853c4b7sm6696181edn.51.2022.12.14.09.40.20 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 14 Dec 2022 09:40:20 -0800 (PST) Date: Wed, 14 Dec 2022 18:40:19 +0100 From: Johannes Weiner To: Michal Hocko Cc: Dave Hansen , "Huang, Ying" , Yang Shi , Wei Xu , Andrew Morton , linux-mm@kvack.org, LKML Subject: Re: memcg reclaim demotion wrt. isolation Message-ID: References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hey Michal, On Wed, Dec 14, 2022 at 04:29:06PM +0100, Michal Hocko wrote: > On Wed 14-12-22 13:40:33, Johannes Weiner wrote: > > The only way to prevent cgroups from disrupting each other on NUMA > > nodes is NUMA constraints. Cgroup per-node limits. That shields not > > only from demotion, but also from DoS-mbinding, or aggressive > > promotion. All of these can result in some form of premature > > reclaim/demotion, proactive demotion isn't special in that way. > > Any numa based balancing is a real challenge with memcg semantic. I do > not see per numa node memcg limits without a major overhaul of how we do > charging though. I am not sure this is on the table even long term. > Unless I am really missing something here we have to live with the > existing semantic for a foreseeable future. Yes, I think you're quite right. We've been mostly skirting the NUMA issue in cgroups (and to a degree in MM code in general) with two possible answers: a) The NUMA distances are close enough that we ignore it and pretend all memory is (mostly) fungible. b) The NUMA distances are big enough that it matters, in which case the best option is to avoid sharing, and use bindings to keep workloads/containers isolated to their own CPU+memory domains. Tiered memory forces the issue by providing memory that must be shared between workloads/containers, but is not fungible. At least not without incurring priority inversions between containers, where a lopri container promotes itself to the top and demotes the hipri workload, while staying happily within its global memory allowance. This applies to mbind() cases as much as it does to NUMA balancing. If these setups proliferate, it seems inevitable to me that sooner or later the full problem space of memory cgroups - dividing up a shared resource while allowing overcommit - applies not just to "RAM as a whole", but to each memory tier individually. Whether we need the full memcg interface per tier or per node, I'm not sure. It might be enough to automatically apportion global allowances to nodes; so if you have 32G toptier and 16G lowtier, and a cgroup has a 20G allowance, it gets 13G on top and 7G on low. (That, or we settle on multi-socket systems with private tiers, such that memory continues to be unshared :-) Either way, I expect this issue will keep coming up as we try to use containers on such systems.