Received: by 2002:a05:6a10:2726:0:0:0:0 with SMTP id ib38csp5751771pxb; Mon, 28 Mar 2022 17:14:22 -0700 (PDT) X-Google-Smtp-Source: ABdhPJzUkjmkJsmqfmMvFOo/6g1lF9CGPo2yjDmMWLsxwEFeOIIUmZITZJcdaVR7OFMaqfZVfhqU X-Received: by 2002:a17:902:e80b:b0:155:c75a:42f with SMTP id u11-20020a170902e80b00b00155c75a042fmr22824181plg.67.1648512861985; Mon, 28 Mar 2022 17:14:21 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1648512861; cv=none; d=google.com; s=arc-20160816; b=fGcvk7CFeMa2qb+bfqzYuuy2mPgnNmT4zburk3WPbrbtpi0+Tgz5m039nMg7cUvSvu WemZTf5V/QronXA1Ht7XAjUDNCY5QXjfNIA4qsXyTDINA4eal4QO9/+jMfytH9hUt5z9 +L5V/R8AkQmT9B2LXDpuEwihISH3u+VYcDn6td1IpCp3J2EtFZJWoP9WkBfhcwQj0+nE WJdHvrNwMXe7waQtbSjPPVD7xJos8wfvTOjEP2pxyNGHySpF4J7v4XDiE5mXT3lvHnvp hONtXvYOT43lRKs32VHxasfBgY4ujkGPd+Q5DrWtXKNP1RNI1A9NeBPy7o0b7vsg55ka Vm/w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:dkim-signature:date; bh=a/CwHgY1OQH27D1lFMO1ZeHx5K6kGyMSqY6y7nOOFAE=; b=czu7WCf9ZCm/yKEaP2GR37mm2ipZy1fPlYXQlxtJ091bT5QKJvKKAOLPfE9K8ItmeL wUj7CSPO2kWursj7I36fKgPZoPPm0jDLsrcdV2rqVg0P0pqrHw7FKGNK5QLbhtJXM4EI iR4AA2fmUbhOInwhysg/dywy7DkDt3Dyg37OP4UdXyQ/VBZRuIImtKHoHQCKNtFYb3LD bmFlyzNTrHhsQylSQYUARSqC4+GUjd30fdCnJxzXT8FAHTE15ZI9kGts3ojBy2BpmhKA AOVdPxB/koRcbPpuFbkILoRZsMzRa11ZJwtHN36hD8bvdn/zNR1HPPUrB/F3vCQ+XQ1F T/SA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linux.dev header.s=key1 header.b=OtAGsM4w; spf=softfail (google.com: domain of transitioning linux-kernel-owner@vger.kernel.org does not designate 23.128.96.19 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linux.dev Return-Path: Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net. [23.128.96.19]) by mx.google.com with ESMTPS id r1-20020a170902c7c100b00153b2d16638si13366177pla.576.2022.03.28.17.13.58 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 28 Mar 2022 17:14:21 -0700 (PDT) Received-SPF: softfail (google.com: domain of transitioning linux-kernel-owner@vger.kernel.org does not designate 23.128.96.19 as permitted sender) client-ip=23.128.96.19; Authentication-Results: mx.google.com; dkim=pass header.i=@linux.dev header.s=key1 header.b=OtAGsM4w; spf=softfail (google.com: domain of transitioning linux-kernel-owner@vger.kernel.org does not designate 23.128.96.19 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linux.dev Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 9B321139AF2; Mon, 28 Mar 2022 16:45:03 -0700 (PDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231181AbiC1Xqg (ORCPT + 99 others); Mon, 28 Mar 2022 19:46:36 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43340 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231130AbiC1Xqe (ORCPT ); Mon, 28 Mar 2022 19:46:34 -0400 Received: from out2.migadu.com (out2.migadu.com [188.165.223.204]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 2A68613DFB for ; Mon, 28 Mar 2022 16:44:52 -0700 (PDT) Date: Mon, 28 Mar 2022 16:44:45 -0700 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1648511090; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=a/CwHgY1OQH27D1lFMO1ZeHx5K6kGyMSqY6y7nOOFAE=; b=OtAGsM4wI8P38y+LX/EzDOHIeRGVdxEnG4+bxRWWO9b6gmBsR8lvwLOWDDzNYtF8zdvKTy u3Yx4YFiKFwjVddd+CeFXt1b2xVWAh8aY4VWve4ba2U32S82IerLXbUZbdBcq/eTK1kS5k sB61GcIJD7nmGIBQjT4nc1UCF/wrt00= X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Roman Gushchin To: Waiman Long Cc: Muchun Song , Andrew Morton , Linux Memory Management List , LKML Subject: Re: [PATCH-mm v3] mm/list_lru: Optimize memcg_reparent_list_lru_node() Message-ID: References: <20220309144000.1470138-1-longman@redhat.com> <2263666d-5eef-b1fe-d5e3-b166a3185263@redhat.com> <5aa687c4-2888-7977-8c1a-d51384e685aa@redhat.com> <9e184cff-263a-d83a-0fc9-0ac7d453aa2a@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <9e184cff-263a-d83a-0fc9-0ac7d453aa2a@redhat.com> X-Migadu-Flow: FLOW_OUT X-Migadu-Auth-User: linux.dev X-Spam-Status: No, score=-2.0 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,RDNS_NONE,SPF_HELO_NONE,T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Mar 28, 2022 at 05:20:16PM -0400, Waiman Long wrote: > On 3/28/22 17:12, Roman Gushchin wrote: > > On Mon, Mar 28, 2022 at 04:46:39PM -0400, Waiman Long wrote: > > > On 3/28/22 15:12, Roman Gushchin wrote: > > > > On Sun, Mar 27, 2022 at 08:57:15PM -0400, Waiman Long wrote: > > > > > On 3/22/22 22:12, Muchun Song wrote: > > > > > > On Wed, Mar 23, 2022 at 9:55 AM Waiman Long wrote: > > > > > > > On 3/22/22 21:06, Muchun Song wrote: > > > > > > > > On Wed, Mar 9, 2022 at 10:40 PM Waiman Long wrote: > > > > > > > > > Since commit 2c80cd57c743 ("mm/list_lru.c: fix list_lru_count_node() > > > > > > > > > to be race free"), we are tracking the total number of lru > > > > > > > > > entries in a list_lru_node in its nr_items field. In the case of > > > > > > > > > memcg_reparent_list_lru_node(), there is nothing to be done if nr_items > > > > > > > > > is 0. We don't even need to take the nlru->lock as no new lru entry > > > > > > > > > could be added by a racing list_lru_add() to the draining src_idx memcg > > > > > > > > > at this point. > > > > > > > > Hi Waiman, > > > > > > > > > > > > > > > > Sorry for the late reply. Quick question: what if there is an inflight > > > > > > > > list_lru_add()? How about the following race? > > > > > > > > > > > > > > > > CPU0: CPU1: > > > > > > > > list_lru_add() > > > > > > > > spin_lock(&nlru->lock) > > > > > > > > l = list_lru_from_kmem(memcg) > > > > > > > > memcg_reparent_objcgs(memcg) > > > > > > > > memcg_reparent_list_lrus(memcg) > > > > > > > > memcg_reparent_list_lru() > > > > > > > > memcg_reparent_list_lru_node() > > > > > > > > if (!READ_ONCE(nlru->nr_items)) > > > > > > > > // Miss reparenting > > > > > > > > return > > > > > > > > // Assume 0->1 > > > > > > > > l->nr_items++ > > > > > > > > // Assume 0->1 > > > > > > > > nlru->nr_items++ > > > > > > > > > > > > > > > > IIUC, we use nlru->lock to serialise this scenario. > > > > > > > I guess this race is theoretically possible but very unlikely since it > > > > > > > means a very long pause between list_lru_from_kmem() and the increment > > > > > > > of nr_items. > > > > > > It is more possible in a VM. > > > > > > > > > > > > > How about the following changes to make sure that this race can't happen? > > > > > > > > > > > > > > diff --git a/mm/list_lru.c b/mm/list_lru.c > > > > > > > index c669d87001a6..c31a0a8ad4e7 100644 > > > > > > > --- a/mm/list_lru.c > > > > > > > +++ b/mm/list_lru.c > > > > > > > @@ -395,9 +395,10 @@ static void memcg_reparent_list_lru_node(struct > > > > > > > list_lru *lru, int nid, > > > > > > > struct list_lru_one *src, *dst; > > > > > > > > > > > > > > /* > > > > > > > - * If there is no lru entry in this nlru, we can skip it > > > > > > > immediately. > > > > > > > + * If there is no lru entry in this nlru and the nlru->lock is free, > > > > > > > + * we can skip it immediately. > > > > > > > */ > > > > > > > - if (!READ_ONCE(nlru->nr_items)) > > > > > > > + if (!READ_ONCE(nlru->nr_items) && !spin_is_locked(&nlru->lock)) > > > > > > I think we also should insert a smp_rmb() between those two loads. > > > > > Thinking about this some more, I believe that adding spin_is_locked() check > > > > > will be enough for x86. However, that will likely not be enough for arches > > > > > with a more relaxed memory semantics. So the safest way to avoid this > > > > > possible race is to move the check to within the lock critical section, > > > > > though that comes with a slightly higher overhead for the 0 nr_items case. I > > > > > will send out a patch to correct that. Thanks for bring this possible race > > > > > to my attention. > > > > Yes, I think it's not enough: > > > > CPU0 CPU1 > > > > READ_ONCE(&nlru->nr_items) -> 0 > > > > spin_lock(&nlru->lock); > > > > nlru->nr_items++; > > > > spin_unlock(&nlru->lock); > > > > && !spin_is_locked(&nlru->lock) -> 0 > > > I have actually thought of that. I am even thinking about reading nr_items > > > again after spin_is_locked(). Still for arches with relaxed memory > > > semantics, when will a memory write by one cpu be propagated to another cpu > > > can be highly variable. It is very hard to prove that it is completely safe. > > > > > > x86 has a more strict memory semantics and it is the only architecture that > > > I have enough confidence that doing the check without taking a lock can be > > > safe. Perhaps we could use this optimization just for x86 and do it inside > > > locks for the rests. > > Hm, is this such a big problem in the real life? Can you describe the setup? > > I'm somewhat resistant to an idea of having arch-specific optimizations here > > without a HUGE reason. > > I am just throwing this idea out for discussion. It does not mean that I > want to do an arch specific patch unless there is performance data to > indicate a substantial gain in performance in some use cases. Got it! I mean it's not obvious from the original commit message if it's just a nice optimization or there was a real life problem. In the first case the best thing is probably to leave everything as it is now, in the latter we need to do something. I'm trying to understand which case it is. Thanks!