Received: by 2002:a05:6358:d09b:b0:dc:cd0c:909e with SMTP id jc27csp6686994rwb; Tue, 22 Nov 2022 17:35:37 -0800 (PST) X-Google-Smtp-Source: AA0mqf7wSLy1us3frN2gOLd3bHJBOErH4dBj8koWEKMnUJYOT/Cf6CnOUJY5s57m+TzDoIyDSpk/ X-Received: by 2002:a62:6445:0:b0:56c:3a0e:cf13 with SMTP id y66-20020a626445000000b0056c3a0ecf13mr7159427pfb.29.1669167337641; Tue, 22 Nov 2022 17:35:37 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1669167337; cv=none; d=google.com; s=arc-20160816; b=pzaO2drfZk3Qgf15fFun8EmgUfiIUQkh/g+xNL/LmH7zU6I3rOfdvnWlEcJhvSle1Z OyYsoWc9EC7Dd5p/nhNS327zlTr1d4MqFnidTMMqaKc2EVH9fRDBqotL7Hp4UD0SJ91b EtUatfY+lKs+kuPBWhjSoBTNAt8DqiaD3KthS63Ih0UpK06OGy0rPb4mvN90CwPJcHZc Ro7E+vG8bywLKU6GCRKvFqBeZVzfLj1n2uMwLL3FzLfasomxz+zq0M8pIBuPdWZa2aGu wDMHKTbIaOXfRy1zxjuzPHRstGAnRE7BvLCr+3ST0raFd31UZwzs1jg78EiY8W/XGDfu yqig== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:dkim-signature:date; bh=DE33Qe05E0HhCIRGVq0ssTZXTU0A7QK1v90ibfJ/wzI=; b=FNlcJxugMJt/lQXbQYnTtR3VG9jVsMhJpHaoNhD/l440OBbhS6LtCKGljhZKDWQ+ja I5IkH5xIPUFZUQaseqsU1usUuR1DK9ghYKzp6GEfj/0Y7O0ZnxCuOS21ahTVe2vdvhW+ fbL3vbd0My641ilGz/FA5Nk1F7aillN32euYD4ufCU6veW0H9nqXfxZ/FOP//zgM26bL tugW4iZWo0IsiDmi7ariIL8P1U+oEyY7SJ7mVzvwECX1GxVOkP9IQNYqNqcy1R4b/VKN LDR/asQmhr+jpJTBrkv2WhL3m3EDCdyltWoQK1VCGlOASTtkT0JTKeAImFrIqu7pffmG 3M8Q== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linux.dev header.s=key1 header.b=dvQLoCLI; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linux.dev Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id mv9-20020a17090b198900b00218b41176a9si391506pjb.182.2022.11.22.17.35.26; Tue, 22 Nov 2022 17:35:37 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@linux.dev header.s=key1 header.b=dvQLoCLI; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linux.dev Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234374AbiKWB0Z (ORCPT + 89 others); Tue, 22 Nov 2022 20:26:25 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:36358 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230009AbiKWB0Y (ORCPT ); Tue, 22 Nov 2022 20:26:24 -0500 Received: from out2.migadu.com (out2.migadu.com [188.165.223.204]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 1152179934 for ; Tue, 22 Nov 2022 17:26:22 -0800 (PST) Date: Tue, 22 Nov 2022 17:26:04 -0800 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1669166780; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=DE33Qe05E0HhCIRGVq0ssTZXTU0A7QK1v90ibfJ/wzI=; b=dvQLoCLI4sJhfToRTwO5THNraG21MkbfI2BqZ4FbQIe/uhco3ARP/bRffvRSCaihzXs2xb Ui8P/nOTr3M/H817gwlpBraOBgZEAUAAyHeeWdMuVt7W6Q6zXwB/L5g0psoRDO7gn6Zlyc iS8VptFr7SOUbFbIrZ6FO+PKj1j5B04= X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Roman Gushchin To: Yosry Ahmed Cc: Shakeel Butt , Johannes Weiner , Michal Hocko , Yu Zhao , Muchun Song , "Matthew Wilcox (Oracle)" , Vasily Averin , Vlastimil Babka , Chris Down , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [PATCH] mm: memcg: fix stale protection of reclaim target memcg Message-ID: References: <20221122232721.2306102-1-yosryahmed@google.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Migadu-Flow: FLOW_OUT X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,SPF_HELO_PASS,SPF_PASS autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Nov 22, 2022 at 04:49:54PM -0800, Yosry Ahmed wrote: > On Tue, Nov 22, 2022 at 4:45 PM Yosry Ahmed wrote: > > > > On Tue, Nov 22, 2022 at 4:37 PM Roman Gushchin wrote: > > > > > > On Tue, Nov 22, 2022 at 11:27:21PM +0000, Yosry Ahmed wrote: > > > > During reclaim, mem_cgroup_calculate_protection() is used to determine > > > > the effective protection (emin and elow) values of a memcg. The > > > > protection of the reclaim target is ignored, but we cannot set their > > > > effective protection to 0 due to a limitation of the current > > > > implementation (see comment in mem_cgroup_protection()). Instead, > > > > we leave their effective protection values unchaged, and later ignore it > > > > in mem_cgroup_protection(). > > > > > > > > However, mem_cgroup_protection() is called later in > > > > shrink_lruvec()->get_scan_count(), which is after the > > > > mem_cgroup_below_{min/low}() checks in shrink_node_memcgs(). As a > > > > result, the stale effective protection values of the target memcg may > > > > lead us to skip reclaiming from the target memcg entirely, before > > > > calling shrink_lruvec(). This can be even worse with recursive > > > > protection, where the stale target memcg protection can be higher than > > > > its standalone protection. > > > > > > > > An example where this can happen is as follows. Consider the following > > > > hierarchy with memory_recursiveprot: > > > > ROOT > > > > | > > > > A (memory.min = 50M) > > > > | > > > > B (memory.min = 10M, memory.high = 40M) > > > > > > > > Consider the following scenarion: > > > > - B has memory.current = 35M. > > > > - The system undergoes global reclaim (target memcg is NULL). > > > > - B will have an effective min of 50M (all of A's unclaimed protection). > > > > - B will not be reclaimed from. > > > > - Now allocate 10M more memory in B, pushing it above it's high limit. > > > > - The system undergoes memcg reclaim from B (target memcg is B) > > > > - In shrink_node_memcgs(), we call mem_cgroup_calculate_protection(), > > > > which immediately returns for B without doing anything, as B is the > > > > target memcg, relying on mem_cgroup_protection() to ignore B's stale > > > > effective min (still 50M). > > > > - Directly after mem_cgroup_calculate_protection(), we will call > > > > mem_cgroup_below_min(), which will read the stale effective min for B > > > > and skip it (instead of ignoring its protection as intended). In this > > > > case, it's really bad because we are not just considering B's > > > > standalone protection (10M), but we are reading a much higher stale > > > > protection (50M) which will cause us to not reclaim from B at all. > > > > > > > > This is an artifact of commit 45c7f7e1ef17 ("mm, memcg: decouple > > > > e{low,min} state mutations from protection checks") which made > > > > mem_cgroup_calculate_protection() only change the state without > > > > returning any value. Before that commit, we used to return > > > > MEMCG_PROT_NONE for the target memcg, which would cause us to skip the > > > > mem_cgroup_below_{min/low}() checks. After that commit we do not return > > > > anything and we end up checking the min & low effective protections for > > > > the target memcg, which are stale. > > > > > > > > Add mem_cgroup_ignore_protection() that checks if we are reclaiming from > > > > the target memcg, and call it in mem_cgroup_below_{min/low}() to ignore > > > > the stale protection of the target memcg. > > > > > > > > Fixes: 45c7f7e1ef17 ("mm, memcg: decouple e{low,min} state mutations from protection checks") > > > > Signed-off-by: Yosry Ahmed > > > > > > Great catch! > > > The fix looks good to me, only a couple of cosmetic suggestions. > > > > > > > --- > > > > include/linux/memcontrol.h | 33 +++++++++++++++++++++++++++------ > > > > mm/vmscan.c | 11 ++++++----- > > > > 2 files changed, 33 insertions(+), 11 deletions(-) > > > > > > > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > > > > index e1644a24009c..22c9c9f9c6b1 100644 > > > > --- a/include/linux/memcontrol.h > > > > +++ b/include/linux/memcontrol.h > > > > @@ -625,18 +625,32 @@ static inline bool mem_cgroup_supports_protection(struct mem_cgroup *memcg) > > > > > > > > } > > > > > > > > -static inline bool mem_cgroup_below_low(struct mem_cgroup *memcg) > > > > +static inline bool mem_cgroup_ignore_protection(struct mem_cgroup *target, > > > > + struct mem_cgroup *memcg) > > > > { > > > > - if (!mem_cgroup_supports_protection(memcg)) > > > > > > How about to merge mem_cgroup_supports_protection() and your new helper into > > > something like mem_cgroup_possibly_protected()? It seems like they never used > > > separately and unlikely ever will be used. > > > > Sounds good! I am thinking maybe mem_cgroup_no_protection() which is > > an inlining of !mem_cgroup_supports_protection() || > > mem_cgorup_ignore_protection(). > > > > > Also, I'd swap target and memcg arguments. > > > > Sounds good. > > I just remembered, the reason I put "target" first is to match the > ordering of mem_cgroup_calculate_protection(), otherwise the code in > shrink_node_memcgs() may be confusing. Oh, I see... Nevermind, let's leave it the way it is now. Thanks for checking it out! Roman