Received: by 2002:a05:6358:3188:b0:123:57c1:9b43 with SMTP id q8csp2589172rwd; Sun, 28 May 2023 19:46:27 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ77Ns5gR4eNBOXIcpfUFi71ZT0vTaAQSarb6toF9+Ac1g8ekfDOlXYRlB86s9eziwTB7KzQ X-Received: by 2002:a05:6a20:6a0f:b0:10d:951f:58ba with SMTP id p15-20020a056a206a0f00b0010d951f58bamr8070361pzk.52.1685328387202; Sun, 28 May 2023 19:46:27 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1685328387; cv=none; d=google.com; s=arc-20160816; b=amYaUo/EoN9E9SUlmXX58kjfWSv8BNQSedQqZ0J3qgaimDdLz6NbX2KT45IpOoYugy IPA//iZ2ZwKN0hzkyjm7ukZbQWf8Mj8pzUEoGIZiHH8Yir9HE7P5Abj64NV62CByDV3R m5DIN5kDewYpEThTSwox2dTHYq94j87MxRCpby1WsoCWwHYmOZhet4rcJJDOQGdAZeZ0 f/w3Kg1n/e9Wt3sIKX98eAbCqgRZ06N+uC/W3sbMj/xrx5c0mQaTuyqqyaxMiIMLZ3OW lPVd6ZMwJuv8+nE+MW7uoaW/mrUXfWaTVBiQXl1T7S8zc26XyzTyKtczgigJxcHT0GZk Ft5g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to:from :references:cc:to:content-language:subject:mime-version:date :dkim-signature:message-id; bh=qPg6ZDtxNrVeEaFU+D4Wu5Yq1tkWvN5fR6aSGPARBZA=; b=pBiFaWHTshF39VR2d0vn9tvK3DOG6euUorVJariAKjt2uK92YtKrjct+mUSyD0967X O9bmukh1x0KIb2VreBDRMiU4D3I06S/Phi93eQoLlsq7Zw4Qe8JeLMxKWqDP1EAbSIBB 2tcyUfp6uJxOQoK1gujYvbiOJTXfUCq+ogwJZ5W333ZioLgbCUmZQcuqKgEHz6MMmwR5 isK3tDc+acXJx2+xTIU9Uv+I5XrtpBCZCia+O3Vmrbvx0DC6kBhW2qTfR4vEU8BBgFWa SzhmqOM7XonGhiuUoypoceI8BAeUH4Qh7xwzJzt8X9Et3b/zZ4CU5y+xdtu2M8A9Ju/X +Lrw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linux.dev header.s=key1 header.b=Z8XIoICr; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linux.dev Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id b9-20020a63d309000000b005347ef6ec13si3963733pgg.507.2023.05.28.19.46.14; Sun, 28 May 2023 19:46:27 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@linux.dev header.s=key1 header.b=Z8XIoICr; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linux.dev Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229629AbjE2Cjh (ORCPT + 99 others); Sun, 28 May 2023 22:39:37 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:53886 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229512AbjE2Cjf (ORCPT ); Sun, 28 May 2023 22:39:35 -0400 Received: from out-25.mta0.migadu.com (out-25.mta0.migadu.com [IPv6:2001:41d0:1004:224b::19]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 7DAA9A9 for ; Sun, 28 May 2023 19:39:33 -0700 (PDT) Message-ID: DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1685327971; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=qPg6ZDtxNrVeEaFU+D4Wu5Yq1tkWvN5fR6aSGPARBZA=; b=Z8XIoICremlWHRFpMcWCOQAANOafwHhpcFj6yo+H86RlcpqprVyJ7IZ1w9eblm/lqZG+VB /lKgMQR/5qeYI0UUFPj0ZoRYAIS3cqLFLIAu0/pMs9s5v79n6vavfYbJgZYR2CkffPlstJ acLcdsTWSl0MD+s4mY6Zn9Xbi5cQ4yQ= Date: Mon, 29 May 2023 10:39:21 +0800 MIME-Version: 1.0 Subject: Re: [linus:master] [mm] f95bdb700b: stress-ng.ramfs.ops_per_sec -88.8% regression Content-Language: en-US To: paulmck@kernel.org, Kirill Tkhai Cc: RCU , Yujie Liu , oe-lkp@lists.linux.dev, lkp@intel.com, linux-kernel@vger.kernel.org, Andrew Morton , Vlastimil Babka , Roman Gushchin , =?UTF-8?Q?Christian_K=c3=b6nig?= , David Hildenbrand , Davidlohr Bueso , Johannes Weiner , Michal Hocko , Muchun Song , Shakeel Butt , Yang Shi , linux-mm@kvack.org, ying.huang@intel.com, feng.tang@intel.com, fengwei.yin@intel.com References: <202305230837.db2c233f-yujie.liu@intel.com> <896bbb09-d400-ec73-ba3a-b64c6e9bbe46@linux.dev> <44407892-b7bc-4d6c-8e4a-6452f0ee88b9@paulmck-laptop> X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Qi Zheng In-Reply-To: <44407892-b7bc-4d6c-8e4a-6452f0ee88b9@paulmck-laptop> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Migadu-Flow: FLOW_OUT X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,SPF_HELO_NONE,SPF_PASS, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Paul, On 2023/5/27 19:14, Paul E. McKenney wrote: > On Thu, May 25, 2023 at 12:03:16PM +0800, Qi Zheng wrote: >> On 2023/5/24 19:56, Qi Zheng wrote: >>> On 2023/5/24 19:08, Qi Zheng wrote: >>> >>> [...] >>> >>>> >>>> Well, I just ran the following command and reproduced the result: >>>> >>>> stress-ng --timeout 60 --times --verify --metrics-brief --ramfs 9 & >>>> >>>> 1) with commit 42c9db3970483: >>>> >>>> stress-ng: info:  [11023] setting to a 60 second run per stressor >>>> stress-ng: info:  [11023] dispatching hogs: 9 ramfs >>>> stress-ng: info:  [11023] stressor       bogo ops real time  usr >>>> time sys time   bogo ops/s     bogo ops/s >>>> stress-ng: info:  [11023]                           (secs)    (secs) >>>> (secs)   (real time) (usr+sys time) >>>> stress-ng: info:  [11023] ramfs            774966     60.00 >>>> 10.18 169.45     12915.89        4314.26 >>>> stress-ng: info:  [11023] for a 60.00s run time: >>>> stress-ng: info:  [11023]    1920.11s available CPU time >>>> stress-ng: info:  [11023]      10.18s user time   (  0.53%) >>>> stress-ng: info:  [11023]     169.44s system time (  8.82%) >>>> stress-ng: info:  [11023]     179.62s total time  (  9.35%) >>>> stress-ng: info:  [11023] load average: 8.99 2.69 0.93 >>>> stress-ng: info:  [11023] successful run completed in 60.00s (1 min, >>>> 0.00 secs) >>>> >>>> 2) with commit f95bdb700bc6b: >>>> >>>> stress-ng: info:  [37676] dispatching hogs: 9 ramfs >>>> stress-ng: info:  [37676] stressor       bogo ops real time  usr >>>> time sys time   bogo ops/s     bogo ops/s >>>> stress-ng: info:  [37676]                           (secs)    (secs) >>>> (secs)   (real time) (usr+sys time) >>>> stress-ng: info:  [37676] ramfs            168673     60.00 >>>> 1.61   39.66      2811.08        4087.47 >>>> stress-ng: info:  [37676] for a 60.10s run time: >>>> stress-ng: info:  [37676]    1923.36s available CPU time >>>> stress-ng: info:  [37676]       1.60s user time   (  0.08%) >>>> stress-ng: info:  [37676]      39.66s system time (  2.06%) >>>> stress-ng: info:  [37676]      41.26s total time  (  2.15%) >>>> stress-ng: info:  [37676] load average: 7.69 3.63 2.36 >>>> stress-ng: info:  [37676] successful run completed in 60.10s (1 min, >>>> 0.10 secs) >>>> >>>> The bogo ops/s (real time) did drop significantly. >>>> >>>> And the memory reclaimation was not triggered in the whole process. so >>>> theoretically no one is in the read critical section of shrinker_srcu. >>>> >>>> Then I found that some stress-ng-ramfs processes were in >>>> TASK_UNINTERRUPTIBLE state for a long time: >>>> >>>> root       42313  0.0  0.0  69592  2068 pts/0    S    19:00   0:00 >>>> stress-ng-ramfs [run] >>>> root       42314  0.0  0.0  69592  2068 pts/0    S    19:00   0:00 >>>> stress-ng-ramfs [run] >>>> root       42315  0.0  0.0  69592  2068 pts/0    S    19:00   0:00 >>>> stress-ng-ramfs [run] >>>> root       42316  0.0  0.0  69592  2068 pts/0    S    19:00   0:00 >>>> stress-ng-ramfs [run] >>>> root       42317  7.8  0.0  69592  1812 pts/0    D    19:00   0:02 >>>> stress-ng-ramfs [run] >>>> root       42318  0.0  0.0  69592  2068 pts/0    S    19:00   0:00 >>>> stress-ng-ramfs [run] >>>> root       42319  7.8  0.0  69592  1812 pts/0    D    19:00   0:02 >>>> stress-ng-ramfs [run] >>>> root       42320  0.0  0.0  69592  2068 pts/0    S    19:00   0:00 >>>> stress-ng-ramfs [run] >>>> root       42321  7.8  0.0  69592  1812 pts/0    D    19:00   0:02 >>>> stress-ng-ramfs [run] >>>> root       42322  0.0  0.0  69592  2068 pts/0    S    19:00   0:00 >>>> stress-ng-ramfs [run] >>>> root       42323  7.8  0.0  69592  1812 pts/0    D    19:00   0:02 >>>> stress-ng-ramfs [run] >>>> root       42324  0.0  0.0  69592  2068 pts/0    S    19:00   0:00 >>>> stress-ng-ramfs [run] >>>> root       42325  7.8  0.0  69592  1812 pts/0    D    19:00   0:02 >>>> stress-ng-ramfs [run] >>>> root       42326  0.0  0.0  69592  2068 pts/0    S    19:00   0:00 >>>> stress-ng-ramfs [run] >>>> root       42327  7.9  0.0  69592  1812 pts/0    D    19:00   0:02 >>>> stress-ng-ramfs [run] >>>> root       42328  7.9  0.0  69592  1812 pts/0    D    19:00   0:02 >>>> stress-ng-ramfs [run] >>>> root       42329  7.9  0.0  69592  1812 pts/0    D    19:00   0:02 >>>> stress-ng-ramfs [run] >>>> root       42330  7.9  0.0  69592  1556 pts/0    D    19:00   0:02 >>>> stress-ng-ramfs [run] >>>> >>>> Their call stack is as follows: >>>> >>>> cat /proc/42330/stack >>>> >>>> [<0>] __synchronize_srcu.part.21+0x83/0xb0 >>>> [<0>] unregister_shrinker+0x85/0xb0 >>>> [<0>] deactivate_locked_super+0x27/0x70 >>>> [<0>] cleanup_mnt+0xb8/0x140 >>>> [<0>] task_work_run+0x65/0x90 >>>> [<0>] exit_to_user_mode_prepare+0x1ba/0x1c0 >>>> [<0>] syscall_exit_to_user_mode+0x1b/0x40 >>>> [<0>] do_syscall_64+0x44/0x80 >>>> [<0>] entry_SYSCALL_64_after_hwframe+0x63/0xcd >>>> >>>> + RCU folks, Is this result as expected? I would have thought that >>>> synchronize_srcu() should return quickly if no one is in the read >>>> critical section. :( > > In theory, it would indeed be nice if synchronize_srcu() would do that. > In practice, the act of checking to see if there is anyone in an SRCU > read-side critical section is a heavy-weight operation, involving at > least one cache miss per CPU along with a number of full memory barriers. > > So SRCU has to be careful to not check too frequently. Got it. > > However, if SRCU has been idle for some time, normal synchronize_srcu() > will do an immediate check. And this will of course mark SRCU as > non-idle. > >>> With the following changes, ops/s can return to previous levels: >> >> Or just set rcu_expedited to 1: >> echo 1 > /sys/kernel/rcu_expedited > > This does cause SRCU to be much more aggressive. This can be a good > choice for small systems, but please keep in mind that this affects normal > RCU as well as SRCU. It will cause RCU to also be much more aggressive, > sending IPIs to CPUs that are (or might be) in RCU read-side critical > sections. Depending on your workload, this might or might not be what > you want RCU to be doing. For example, if you are running aggressive > real-time workloads, it most definitely is not what you want. Yeah, that's not what I want, a shrinker might run for a long time. > >>> diff --git a/mm/vmscan.c b/mm/vmscan.c >>> index db2ed6e08f67..90f541b07cd1 100644 >>> --- a/mm/vmscan.c >>> +++ b/mm/vmscan.c >>> @@ -763,7 +763,7 @@ void unregister_shrinker(struct shrinker *shrinker) >>>         debugfs_entry = shrinker_debugfs_remove(shrinker); >>>         up_write(&shrinker_rwsem); >>> >>> -       synchronize_srcu(&shrinker_srcu); >>> +       synchronize_srcu_expedited(&shrinker_srcu); > > If shrinkers are unregistered only occasionally, this is an entirely > reasonable change. > >>>         debugfs_remove_recursive(debugfs_entry); >>> >>> stress-ng: info:  [13159] dispatching hogs: 9 ramfs >>> stress-ng: info:  [13159] stressor       bogo ops real time  usr time >>> sys time   bogo ops/s     bogo ops/s >>> stress-ng: info:  [13159]                           (secs)    (secs) >>> (secs)   (real time) (usr+sys time) >>> stress-ng: info:  [13159] ramfs            710062     60.00      9.63 >>> 157.26     11834.18        4254.75 >>> stress-ng: info:  [13159] for a 60.00s run time: >>> stress-ng: info:  [13159]    1920.14s available CPU time >>> stress-ng: info:  [13159]       9.62s user time   (  0.50%) >>> stress-ng: info:  [13159]     157.26s system time (  8.19%) >>> stress-ng: info:  [13159]     166.88s total time  (  8.69%) >>> stress-ng: info:  [13159] load average: 9.49 4.02 1.65 >>> stress-ng: info:  [13159] successful run completed in 60.00s (1 min, >>> 0.00 secs) >>> >>> Can we make synchronize_srcu() call synchronize_srcu_expedited() when no >>> one is in the read critical section? > > Yes, in theory we could, but this would be a bad thing in practice. > After all, the point of having synchronize_srcu() be separate from > synchronize_srcu_expedited() is to allow uses that are OK with longer > latency avoid consuming too much CPU. In addition, that longer > SRCU grace-period latency allows the next grace period to handle more > synchronize_srcu() and call_srcu() requests. This amortizes the > overhead of that next grace period over a larger number of updates. > > However, your use of synchronize_srcu_expedited() does have that effect, > but only for this call point. Which has the advantage of avoiding > burning excessive quantities of CPU for the other 50+ call points. Thanks for such a detailed explanation. Now I think we can continue to try to complete the idea[1] from Kirill Tkhai. The patch moves heavy synchronize_srcu() to delayed work, so it doesn't affect on user-visible unregistration speed. [1]. https://lore.kernel.org/lkml/153365636747.19074.12610817307548583381.stgit@localhost.localdomain/ Thanks, Qi > > Thanx, Paul