Received: by 2002:a05:6358:3188:b0:123:57c1:9b43 with SMTP id q8csp28465364rwd; Tue, 4 Jul 2023 20:30:27 -0700 (PDT) X-Google-Smtp-Source: APBJJlF6zFwFEuQAEHPe+FXyLB2AA7lBfkLBc3bXEHFx1qWF24noKpEOHb5gksUDbQ8xQv3aPcsH X-Received: by 2002:a05:6808:eca:b0:3a3:c64f:3446 with SMTP id q10-20020a0568080eca00b003a3c64f3446mr3526564oiv.39.1688527826894; Tue, 04 Jul 2023 20:30:26 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1688527826; cv=none; d=google.com; s=arc-20160816; b=Lv5RRzv9vo3k4Wrm+OMJdFaZ/6ZEAv5jLcnut/a2QE+aZdbWBz6hK/adK+PXnCS751 QFgf0A3TxdovCIqgdLPZ+OzqjCvr1PKoC0nvejZFePEAPMjBNUlAFAGP+PRD+jJoDSxz Cybnqe2uo+wbphPBAJQcWQUjl1WIkvFm3v4418eSk4ikLas2IifgdoBwOmZ6lVbm9mxB UwWZy+c+jqNtmGI8zvZMvU0bHJ3pK5p7HM5+fRK3jJeekYfJgxt5Aj1p9InchQfjbjyA Iat7li5WNPbxnS8Aub6OAOFMCImhgRqoK13jgk+twAsv8Kl28ef/x0fdu5snoDMbhTRk SWCg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to:references :cc:to:from:content-language:subject:user-agent:mime-version:date :message-id:dkim-signature; bh=NOwOUvz1NPkWlBxAp0GPGGO2Nrixl73uZn6xROfOEj8=; fh=nLNXNSt+9+8ML1JEx1UMpNTK/5jiFrsxG97CLxwj6Yk=; b=idyzsp/fzIaMvNkhjCu7wOyUmJtmyeK0BgUP567PIAjgEOIu6fEyQXj0POL37V9tZV q+DSPtHaJESJ5Aop1dDEg/F5vlxjGWOto2kQdTfx4JLAWLL0/lYqiScnCM2UGcSZM6XJ nij3sJyUaqNuDRLDTnsPDtV9BMJGeEl1Y6LS174wokH23fKctZ3PLDimOrs1amSpeg19 zjLl6uwGtsNl7HTRzBkq6Yz7+WBzsbl1Z9tDRi+n3kGjRH2FNM2N6Rq9iY9tQaWLRCn4 ViOVtIB7EZGDD23lVzyOZoLIq7U2HhbHJmPCzk5lS5PyZ+cgfWQXWl4kbbY7opreak99 qR7w== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@bytedance.com header.s=google header.b=ifLsYO8F; spf=pass (google.com: domain of linux-nfs-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-nfs-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=bytedance.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id l9-20020a170902f68900b001b8923e0320si6843260plg.433.2023.07.04.20.30.05; Tue, 04 Jul 2023 20:30:26 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-nfs-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@bytedance.com header.s=google header.b=ifLsYO8F; spf=pass (google.com: domain of linux-nfs-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-nfs-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=bytedance.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229493AbjGED1m (ORCPT + 99 others); Tue, 4 Jul 2023 23:27:42 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:36766 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229449AbjGED1l (ORCPT ); Tue, 4 Jul 2023 23:27:41 -0400 Received: from mail-pf1-x435.google.com (mail-pf1-x435.google.com [IPv6:2607:f8b0:4864:20::435]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 3CD2E10FB for ; Tue, 4 Jul 2023 20:27:38 -0700 (PDT) Received: by mail-pf1-x435.google.com with SMTP id d2e1a72fcca58-66d6a9851f3so1321874b3a.0 for ; Tue, 04 Jul 2023 20:27:38 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1688527657; x=1691119657; h=content-transfer-encoding:in-reply-to:references:cc:to:from :content-language:subject:user-agent:mime-version:date:message-id :from:to:cc:subject:date:message-id:reply-to; bh=NOwOUvz1NPkWlBxAp0GPGGO2Nrixl73uZn6xROfOEj8=; b=ifLsYO8F2I4MBHVlF2Or5hf7tTNep81Mx+BsSk/nCvCzxaPTLrQYJUCegbZ71CLvqI jnwmhHljMYjJqvknT1p7Cvh4o3Zetn0nbYVkcxdIejyVlEmEoOc51mjv2rm9Oe7hCkmi BhkEzv2gsaM2VKlzfD2ZXlb5JXn4cFBH6rlbCG9Gisi0SSs52NHZD0JL0DYBTusr279z ECt8JJbhMcUgBeDhNudIJn3gOdFaKif3GEH22u6ET+bb2NCPdK1crIBUZU+UjNX2cqQu Xo1u3OB8WsD6YcqowSF12wIwDOJl+hjeHnPzPy5ZJwY8SE6f5Rm09aAg64VyQm7gLUPN jgMw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1688527657; x=1691119657; h=content-transfer-encoding:in-reply-to:references:cc:to:from :content-language:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=NOwOUvz1NPkWlBxAp0GPGGO2Nrixl73uZn6xROfOEj8=; b=QWUklVcE56+XTr2/6gZtJiUwPNUEo7/EKTrmxr63fAdabeXn42CtOOnDAcQpiprsPH YPQBkiyn1R3RXyuo+L+E9vGlHzEenHPT+wXeSq7nPneXgoyitMX+j6TOVJ1+qh4iNuZl QxqsD2n7omXmK9Ax8OAH20QB1O6YZLu0zq7dVs8U9pK9Uy0lHDPcMRqVDMMr8t3h5ts5 rpXGGvAN7bNqikbMLf44n2K3C0IQLXfRl34SNTIQGbdw/jBsd4OyaQSbqZK9N4IAo+1Y zNVQQ9/9P6MtPjaiZXhqOHzI7OMHTv4q2W2DiUn33MSxC0davT25As/3sk1qEKBZG6J4 EQtw== X-Gm-Message-State: ABy/qLagFVCv5aYLNsYupN1EPWtDZIhtfmNpntg2/cgoEdZI6C31yoJF 0glR4IGKojHJmElA2YEGs1uzHQ== X-Received: by 2002:a05:6a00:1f90:b0:675:8627:a291 with SMTP id bg16-20020a056a001f9000b006758627a291mr16245087pfb.3.1688527657628; Tue, 04 Jul 2023 20:27:37 -0700 (PDT) Received: from [10.70.252.135] ([203.208.167.147]) by smtp.gmail.com with ESMTPSA id fe10-20020a056a002f0a00b0064fde7ae1ffsm13136627pfb.38.2023.07.04.20.27.30 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Tue, 04 Jul 2023 20:27:37 -0700 (PDT) Message-ID: <733af312-fb2d-3ec4-54c8-f154447c2051@bytedance.com> Date: Wed, 5 Jul 2023 11:27:28 +0800 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:102.0) Gecko/20100101 Thunderbird/102.12.0 Subject: Re: [PATCH 24/29] mm: vmscan: make global slab shrink lockless Content-Language: en-US From: Qi Zheng To: paulmck@kernel.org, Dave Chinner Cc: Vlastimil Babka , akpm@linux-foundation.org, tkhai@ya.ru, roman.gushchin@linux.dev, djwong@kernel.org, brauner@kernel.org, tytso@mit.edu, linux-kernel@vger.kernel.org, linux-mm@kvack.org, intel-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org, linux-arm-msm@vger.kernel.org, dm-devel@redhat.com, linux-raid@vger.kernel.org, linux-bcache@vger.kernel.org, virtualization@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-ext4@vger.kernel.org, linux-nfs@vger.kernel.org, linux-xfs@vger.kernel.org, linux-btrfs@vger.kernel.org References: <20230622085335.77010-1-zhengqi.arch@bytedance.com> <20230622085335.77010-25-zhengqi.arch@bytedance.com> <3efa68e0-b04f-5c11-4fe2-2db0784064fc@bytedance.com> In-Reply-To: <3efa68e0-b04f-5c11-4fe2-2db0784064fc@bytedance.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-2.2 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,NICE_REPLY_A,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE,URIBL_BLOCKED autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-nfs@vger.kernel.org On 2023/7/4 11:45, Qi Zheng wrote: > > > On 2023/7/4 00:39, Paul E. McKenney wrote: >> On Fri, Jun 23, 2023 at 04:29:39PM +1000, Dave Chinner wrote: >>> On Thu, Jun 22, 2023 at 05:12:02PM +0200, Vlastimil Babka wrote: >>>> On 6/22/23 10:53, Qi Zheng wrote: >>>>> @@ -1067,33 +1068,27 @@ static unsigned long shrink_slab(gfp_t >>>>> gfp_mask, int nid, >>>>>       if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg)) >>>>>           return shrink_slab_memcg(gfp_mask, nid, memcg, priority); >>>>> -    if (!down_read_trylock(&shrinker_rwsem)) >>>>> -        goto out; >>>>> - >>>>> -    list_for_each_entry(shrinker, &shrinker_list, list) { >>>>> +    rcu_read_lock(); >>>>> +    list_for_each_entry_rcu(shrinker, &shrinker_list, list) { >>>>>           struct shrink_control sc = { >>>>>               .gfp_mask = gfp_mask, >>>>>               .nid = nid, >>>>>               .memcg = memcg, >>>>>           }; >>>>> +        if (!shrinker_try_get(shrinker)) >>>>> +            continue; >>>>> +        rcu_read_unlock(); >>>> >>>> I don't think you can do this unlock? >> >> Sorry to be slow to respond here, this one fell through the cracks. >> And thank you to Qi for reminding me! >> >> If you do this unlock, you had jolly well better nail down the current >> element (the one referenced by shrinker), for example, by acquiring an >> explicit reference count on the object.  And presumably this is exactly >> what shrinker_try_get() is doing.  And a look at your 24/29 confirms >> this, >> at least assuming that shrinker->refcount is set to zero before the call >> to synchronize_rcu() in free_module() *and* that synchronize_rcu() >> doesn't >> start until *after* shrinker_put() calls complete().  Plus, as always, >> the object must be removed from the list before the synchronize_rcu() >> starts.  (On these parts of the puzzle, I defer to those more familiar >> with this code path.  And I strongly suggest carefully commenting this >> type of action-at-a-distance design pattern.) > > Yeah, I think I've done it like above. A more detailed timing diagram is > below. > >> >> Why is this important?  Because otherwise that object might be freed >> before you get to the call to rcu_read_lock() at the end of this loop. >> And if that happens, list_for_each_entry_rcu() will be walking the >> freelist, which is quite bad for the health and well-being of your >> kernel. >> >> There are a few other ways to make this sort of thing work: >> >> 1.    Defer the shrinker_put() to the beginning of the loop. >>     You would need a flag initially set to zero, and then set to >>     one just before (or just after) the rcu_read_lock() above. >>     You would also need another shrinker_old pointer to track the >>     old pointer.  Then at the top of the loop, if the flag is set, >>     invoke shrinker_put() on shrinker_old.    This ensures that the >>     previous shrinker structure stays around long enough to allow >>     the loop to find the next shrinker structure in the list. >> >>     This approach is attractive when the removal code path >>     can invoke shrinker_put() after the grace period ends. >> >> 2.    Make shrinker_put() invoke call_rcu() when ->refcount reaches >>     zero, and have the callback function free the object.  This of >>     course requires adding an rcu_head structure to the shrinker >>     structure, which might or might not be a reasonable course of >>     action.  If adding that rcu_head is reasonable, this simplifies >>     the logic quite a bit. >> >> 3.    For the shrinker-structure-removal code path, remove the shrinker >>     structure, then remove the initial count from ->refcount, >>     and then keep doing grace periods until ->refcount is zero, >>     then do one more.  Of course, if the result of removing the >>     initial count was zero, then only a single additional grace >>     period is required. >> >>     This would need to be carefully commented, as it is a bit >>     unconventional. > > Thanks for such a detailed addition! > >> >> There are probably many other ways, but just to give an idea of a few >> other ways to do this. >> >>>>> + >>>>>           ret = do_shrink_slab(&sc, shrinker, priority); >>>>>           if (ret == SHRINK_EMPTY) >>>>>               ret = 0; >>>>>           freed += ret; >>>>> -        /* >>>>> -         * Bail out if someone want to register a new shrinker to >>>>> -         * prevent the registration from being stalled for long >>>>> periods >>>>> -         * by parallel ongoing shrinking. >>>>> -         */ >>>>> -        if (rwsem_is_contended(&shrinker_rwsem)) { >>>>> -            freed = freed ? : 1; >>>>> -            break; >>>>> -        } >>>>> -    } >>>>> -    up_read(&shrinker_rwsem); >>>>> -out: >>>>> +        rcu_read_lock(); >>>> >>>> That new rcu_read_lock() won't help AFAIK, the whole >>>> list_for_each_entry_rcu() needs to be under the single >>>> rcu_read_lock() to be >>>> safe. >>> >>> Yeah, that's the pattern we've been taught and the one we can look >>> at and immediately say "this is safe". >>> >>> This is a different pattern, as has been explained bi Qi, and I >>> think it *might* be safe. >>> >>> *However.* >>> >>> Right now I don't have time to go through a novel RCU list iteration >>> pattern it one step at to determine the correctness of the >>> algorithm. I'm mostly worried about list manipulations that can >>> occur outside rcu_read_lock() section bleeding into the RCU >>> critical section because rcu_read_lock() by itself is not a memory >>> barrier. >>> >>> Maybe Paul has seen this pattern often enough he could simply tell >>> us what conditions it is safe in. But for me to work that out from >>> first principles? I just don't have the time to do that right now. >> >> If the code does just the right sequence of things on the removal path >> (remove, decrement reference, wait for reference to go to zero, wait for >> grace period, free), then it would work.  If this is what is happening, >> I would argue for more comments.  ;-) > > The order of the removal path is slightly different from this: > >     shrink_slab                 unregister_shrinker >     ===========                 =================== > >    shrinker_try_get() >    rcu_read_unlock() >                                 1. decrement initial reference >                 shrinker_put() >                 2. wait for reference to go to zero >                 wait_for_completion() >    rcu_read_lock() > >    shrinker_put() >                 3. remove the shrinker from list >                 list_del_rcu() >                                 4. wait for grace period >                 kfree_rcu()/synchronize_rcu() > > >    list_for_each_entry() > >    shrinker_try_get() >    rcu_read_unlock() >                 5. free the shrinker > > So the order is: decrement reference, wait for reference to go to zero, > remove, wait for grace period, free. > > I think this can work. And we can only do the *step 3* after we hold the > RCU read lock again, right? Please let me know if I missed something. Oh, you are right, It would be better to move step 3 to step 1. We should first remove the shrinker from the shrinker_list to prevent other traversers from finding it again, otherwise the following situations may occur theoretically: CPU 0 CPU 1 shrinker_try_get() shrinker_try_get() shrinker_put() shrinker_try_get() shrinker_put() Thanks, Qi > > Thanks, > Qi > >> >>                             Thanx, Paul >> >>>> IIUC this is why Dave in [4] suggests unifying shrink_slab() with >>>> shrink_slab_memcg(), as the latter doesn't iterate the list but uses >>>> IDR. >>> >>> Yes, I suggested the IDR route because radix tree lookups under RCU >>> with reference counted objects are a known safe pattern that we can >>> easily confirm is correct or not.  Hence I suggested the unification >>> + IDR route because it makes the life of reviewers so, so much >>> easier... >>> >>> Cheers, >>> >>> Dave. >>> -- >>> Dave Chinner >>> david@fromorbit.com