Received: by 2002:a05:7412:3784:b0:e2:908c:2ebd with SMTP id jk4csp2058030rdb; Tue, 3 Oct 2023 09:01:32 -0700 (PDT) X-Google-Smtp-Source: AGHT+IEZV6uMZwv5OOTAcb3/zmD5izHOXP5UR4CsByPyEyVSZ56PK08KFH9EE3zsJlfSPziPQFHW X-Received: by 2002:a05:6a21:a599:b0:15b:2125:890f with SMTP id gd25-20020a056a21a59900b0015b2125890fmr15997918pzc.37.1696348892070; Tue, 03 Oct 2023 09:01:32 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1696348892; cv=none; d=google.com; s=arc-20160816; b=kK409AWn6/w35Pz8FBX0KL0Nd08vKvAZhs2/PxrQhKutLIVT+o7pTqFWZBDKVV8rYv lUe31ZTat1wH142Ye5lMxRIi95MprrS0IwANGlpW0DGgjZoCIiARyyS86qGWrAj0/M26 J0XQRXJutuJ6MwMUelpE9QO8mU0PnNW7qi+In5CxmVttxXq85LrltZrE2uK6bJspY6zE 2CZ+S2I4eL2miR/OJ7fl1aLzujxKAzzfLmSeghV1sWn8WJ7DaQ3LU3h6dKEiw5wv+3VB ISFyYYy/Q30Nzn8i38ak+fbVMNWldnh4mhjpePelJIBInnfshVlhAsn9Z0jWTKGxerCg I1Aw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=L9u6M/qg0tY6l2Kd5bsxAtI2pU7AK646ndJLozYLDOY=; fh=Hs+nt7rLiaCyum/mS7xmy8oPGO+0xqvx23r1vW7eRvU=; b=OTLZODT5cTQWhpfm/uR3EcV++u/ZL5OP6orTbR/9kjDOEmJ84SMczATZ0k8FvMsrD5 Op4zAe7TZsw+ZPeRHg/WfiSuqXyaD0IOhXIMrrMmXRllVVyL+zoLgnS+siHeCr9pGdBf Rh209J+woJfgCQIOuIXFX0n8TltTjbwq30Z/OhV/dJK7ilccniN/Jz/jLpuABC4OIrJU rJ2Z59zw5/gZrzPdiZeNn4viDjVgiR1fhZjRCAwpi4mDOlTssQqbTjp8mF5auENiGOMV KPn5hl/H7jkWHlBe1fa0WRWZcSCNuvkJ28+8cdO/YxqE+Lpjd7Mtb9VGKuQx1Vj2rP7S spag== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@cmpxchg-org.20230601.gappssmtp.com header.s=20230601 header.b=RxnqM2rc; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.36 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=cmpxchg.org Return-Path: Received: from pete.vger.email (pete.vger.email. [23.128.96.36]) by mx.google.com with ESMTPS id h13-20020a170902eecd00b001c7264c45a6si1602154plb.563.2023.10.03.09.01.31 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 03 Oct 2023 09:01:32 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.36 as permitted sender) client-ip=23.128.96.36; Authentication-Results: mx.google.com; dkim=pass header.i=@cmpxchg-org.20230601.gappssmtp.com header.s=20230601 header.b=RxnqM2rc; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.36 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=cmpxchg.org Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by pete.vger.email (Postfix) with ESMTP id B0949810020A; Tue, 3 Oct 2023 09:01:20 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at pete.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S239239AbjJCP7v (ORCPT + 99 others); Tue, 3 Oct 2023 11:59:51 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:59202 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S240235AbjJCP7s (ORCPT ); Tue, 3 Oct 2023 11:59:48 -0400 Received: from mail-qv1-xf29.google.com (mail-qv1-xf29.google.com [IPv6:2607:f8b0:4864:20::f29]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 45AFEFB for ; Tue, 3 Oct 2023 08:59:33 -0700 (PDT) Received: by mail-qv1-xf29.google.com with SMTP id 6a1803df08f44-65b051a28b3so6981196d6.2 for ; Tue, 03 Oct 2023 08:59:33 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cmpxchg-org.20230601.gappssmtp.com; s=20230601; t=1696348772; x=1696953572; darn=vger.kernel.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=L9u6M/qg0tY6l2Kd5bsxAtI2pU7AK646ndJLozYLDOY=; b=RxnqM2rctBvIne417JHVEVWEdV+SP+0ZxG55fpPTPq04SuETvDMDeyxh3G/ZRjnsiR Su7gWpdiglr2NABUjLSFpjzE0AdkcMdczdF677s9VPPpGevxJCwTQBaqQb4TwE4uzdK1 4mSXYsqHW6tqQrcn9h5hd19/ZXspEZb+l167cO8mnG9OB5w3vmGFfQueH9TLL0kGNT7R emLbnKqqhQiuaLEQD7iSRUhfkWA6TlRU4oA58yeJjoPSRczkMLUFUaFIeWSwJdqhIrc4 PDvHowfNnx1XmJ5U9sVedKABNQ+KkIMdG6jtdc5BevnHv6dJM2qZXtWIuN22XFxXborV 9Nqg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1696348772; x=1696953572; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=L9u6M/qg0tY6l2Kd5bsxAtI2pU7AK646ndJLozYLDOY=; b=JsPwl7iVxy5eE4YFJ1L7iHMF9uHcyr4jxX/8ka/ZxXnvPVor8/5G3qrYDZ8bVxcVF8 ZiL0m3x3O6X5vWAbRkmL/l4kBw+rj4JF0sh0GVOLm3mh6zbG1p7WzURVVzPxLuZTK9+c TM/5zyVWLBksuYJIeN0hMvAIeki9oaX7P2h7yG3eearIKVzJwOAFyomyi23zopeZiOpF 5rjSLbZc5Ya8bDFnKR2dr+sEsGFUdlwj+VhvgTCQ/qbB5bUviSuvBBcC+F4eqrn+G0Os e+U8BIoNBbTuvXr0HPVmiBhk18Zvs1dfrg2Witd78TivzX4Lhw73fEPzkuq7BsEpqezf RQCw== X-Gm-Message-State: AOJu0YybCd1n2vSXTa9ZMw265UPhPiPH5EO3UznegfnQIZ+n1EYV6sUk +mEFONmD3+EYOzs6ohuDcJ1oog== X-Received: by 2002:a0c:a9db:0:b0:655:71df:4c7d with SMTP id c27-20020a0ca9db000000b0065571df4c7dmr14977510qvb.56.1696348772174; Tue, 03 Oct 2023 08:59:32 -0700 (PDT) Received: from localhost (2603-7000-0c01-2716-3012-16a2-6bc2-2937.res6.spectrum.com. [2603:7000:c01:2716:3012:16a2:6bc2:2937]) by smtp.gmail.com with ESMTPSA id o10-20020a0cf4ca000000b0065862497fd2sm591100qvm.22.2023.10.03.08.59.31 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 03 Oct 2023 08:59:31 -0700 (PDT) Date: Tue, 3 Oct 2023 11:59:31 -0400 From: Johannes Weiner To: Michal Hocko Cc: Nhat Pham , akpm@linux-foundation.org, riel@surriel.com, roman.gushchin@linux.dev, shakeelb@google.com, muchun.song@linux.dev, tj@kernel.org, lizefan.x@bytedance.com, shuah@kernel.org, mike.kravetz@oracle.com, yosryahmed@google.com, fvdl@google.com, linux-mm@kvack.org, kernel-team@meta.com, linux-kernel@vger.kernel.org, cgroups@vger.kernel.org Subject: Re: [PATCH v3 2/3] hugetlb: memcg: account hugetlb-backed memory in memory controller Message-ID: <20231003155931.GF17012@cmpxchg.org> References: <20231003001828.2554080-1-nphamcs@gmail.com> <20231003001828.2554080-3-nphamcs@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Spam-Status: No, score=-0.8 required=5.0 tests=DKIM_SIGNED,DKIM_VALID, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on pete.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (pete.vger.email [0.0.0.0]); Tue, 03 Oct 2023 09:01:21 -0700 (PDT) On Tue, Oct 03, 2023 at 02:58:58PM +0200, Michal Hocko wrote: > On Mon 02-10-23 17:18:27, Nhat Pham wrote: > > Currently, hugetlb memory usage is not acounted for in the memory > > controller, which could lead to memory overprotection for cgroups with > > hugetlb-backed memory. This has been observed in our production system. > > > > For instance, here is one of our usecases: suppose there are two 32G > > containers. The machine is booted with hugetlb_cma=6G, and each > > container may or may not use up to 3 gigantic page, depending on the > > workload within it. The rest is anon, cache, slab, etc. We can set the > > hugetlb cgroup limit of each cgroup to 3G to enforce hugetlb fairness. > > But it is very difficult to configure memory.max to keep overall > > consumption, including anon, cache, slab etc. fair. > > > > What we have had to resort to is to constantly poll hugetlb usage and > > readjust memory.max. Similar procedure is done to other memory limits > > (memory.low for e.g). However, this is rather cumbersome and buggy. > > Could you expand some more on how this _helps_ memory.low? The > hugetlb memory is not reclaimable so whatever portion of its memcg > consumption will be "protected from the reclaim". Consider this > parent > / \ > A B > low=50% low=0 > current=40% current=60% > > We have an external memory pressure and the reclaim should prefer B as A > is under its low limit, correct? But now consider that the predominant > consumption of B is hugetlb which would mean the memory reclaim cannot > do much for B and so the A's protection might be breached. > > As an admin (or a tool) you need to know about hugetlb as a potential > contributor to this behavior (sure mlocked memory would behave the same > but mlock rarely consumes huge amount of memory in my experience). > Without the accounting there might not be any external pressure in the > first place. > > All that being said, I do not see how adding hugetlb into accounting > makes low, min limits management any easier. It's important to differentiate the cgroup usecases. One is of course the cloud/virtual server scenario, where you set the hard limits to whatever the customer paid for, and don't know and don't care about the workload running inside. In that case, memory.low and overcommit aren't really safe to begin with due to unknown unreclaimable mem. The other common usecase is the datacenter where you run your own applications. You understand their workingset and requirements, and configure and overcommit the containers in a way where jobs always meet their SLAs. E.g. if multiple containers spike, memory.low is set such that interactive workloads are prioritized over batch jobs, and both have priority over routine system management tasks. This is arguably the only case where it's safe to use memory.low. You have to know what's reclaimable and what isn't, otherwise you cannot know that memory.low will even do anything, and isolation breaks down. So we already have that knowledge: mlocked sections, how much anon is without swap space, and how much memory must not be reclaimed (even if it is reclaimable) for the workload to meet its SLAs. Hugetlb doesn't really complicate this equation - we already have to consider it unreclaimable workingset from an overcommit POV on those hosts. The reason this patch helps in this scenario is that the service teams are usually different from the containers/infra team. The service understands their workload and declares its workingset. But it's the infra team running the containers that currently has to go and find out if they're using hugetlb and tweak the cgroups. Bugs and untimeliness in the tweaking have caused multiple production incidents already. And both teams are regularly confused when there are large parts of the workload that don't show up in memory.current which both sides monitor. Keep in mind that these systems are already pretty complex, with multiple overcommitted containers and system-level activity. The current hugetlb quirk can heavily distort what a given container is doing on the host. With this patch, the service can declare its workingset, the container team can configure the container, and memory.current makes sense to everybody. The workload parameters are pretty reliable, but if the service team gets it wrong and we underprotect the workload, and/or its unreclaimable memory exceeds what was declared, the infra team gets alarms on elevated LOW breaching events and investigates if its an infra problem or a service spec problem that needs escalation. So the case you describe above only happens when mistakes are made, and we detect and rectify them. In the common case, hugetlb is part of the recognized workingset, and we configure memory.low to cut off only known optional and reclaimable memory under pressure.