Received: by 2002:a5d:9c59:0:0:0:0:0 with SMTP id 25csp62034iof; Sun, 5 Jun 2022 21:10:02 -0700 (PDT) X-Google-Smtp-Source: ABdhPJyd16MskfLUj2GP/xra1VPk6ri0brnEZWJXEIMuIQBCVwp0qh3dttBYtiimE1rsY7jCMB4R X-Received: by 2002:a63:4e62:0:b0:398:cb40:19b0 with SMTP id o34-20020a634e62000000b00398cb4019b0mr19410466pgl.445.1654488602440; Sun, 05 Jun 2022 21:10:02 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1654488602; cv=none; d=google.com; s=arc-20160816; b=vWB68G+hjdciwmUqVQ/b0LeC8eNeuns7zEVnCBlZz0tgyVBYFoM1fhO2YK4j+GR+6F zMgmllWB2v8Mk37dJ8L76noO/s5oDoFuf8bj63NB5RPBhRn/gh8U4GYoWj35q5t2cvry jaM0etSvk2Ydyg9YA75lf1ff49fHDDYipE50drGT5buW3L93ydvXOVt9YlJty6Bp2jgE s3vvwYTkNyvEYYvI49k1wA3QIgc1kKX0BZWITHOddHzDJ85syviul6t0ZVg/ODR1OMJJ sdQyPBdjrL3VqPw9mtPaWdeyghv4gQz+o0Pdzusydr5wcJghs2bihxT1Gjop59SZrItu 87jw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=frdNSw6VvFSq/dJTwsjI7LLBzyeNJyK6cwhGv34I0Y4=; b=QvfOqg+tYtlZlEieQ5E7CerKZOhTpbv6HQrF2uZg3ikiWz0k4c9m4h7Uhmdh123JCD U/Hd2nOI6XBLR9CpL8pwnf8QvCp9zAdRI2jBwRtkJj2jkMtFTJU4Gs8xm1fqfhtgV5Wj ew9jHS4a5Hgkqzx3ZA+HJCHSlCwWuxlarJXlvc/A4On8ccUNl8kTiDCbpvT6JpsFecQv TXpqYYiNv9rtwA70+8DutdWfO4qMPGhynJCeTXf1rX7KxNc4QaxDY1rFo/WJAinX39xH ieJPpRIh4vvVjjdq7FeIACRVTuZXa1XwYn7ML7MeEf6yeAZkhulfK2Id44/meBGHpqiL XN+Q== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b="N9ml/l1I"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net. [2620:137:e000::1:18]) by mx.google.com with ESMTPS id j8-20020a170903024800b001620bb377d2si23995707plh.535.2022.06.05.21.10.02 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sun, 05 Jun 2022 21:10:02 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:18 as permitted sender) client-ip=2620:137:e000::1:18; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b="N9ml/l1I"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 7BC766EC72; Sun, 5 Jun 2022 20:49:35 -0700 (PDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S241955AbiFDD7T (ORCPT + 99 others); Fri, 3 Jun 2022 23:59:19 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:53052 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232367AbiFDD7Q (ORCPT ); Fri, 3 Jun 2022 23:59:16 -0400 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 8331C2DD64 for ; Fri, 3 Jun 2022 20:59:14 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1654315152; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=frdNSw6VvFSq/dJTwsjI7LLBzyeNJyK6cwhGv34I0Y4=; b=N9ml/l1IJV0U8u1/cs6sOTePKyK660rPNteor2pn63J1vb5b6S4N33hqhdpwOra0ynqccy vmNn9vqhz9+eHvVkehSBbSY9tMqjD5khRLdhYOF829lM4hhB5kWwGdWuwGNnxHGScu0twx 7kuKGxhNmesIPJswrgyD85cmhdMHnJY= Received: from mimecast-mx02.redhat.com (mx3-rdu2.redhat.com [66.187.233.73]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-671-OyIuMPekNhqhy9OiWUO0rA-1; Fri, 03 Jun 2022 23:59:08 -0400 X-MC-Unique: OyIuMPekNhqhy9OiWUO0rA-1 Received: from smtp.corp.redhat.com (int-mx10.intmail.prod.int.rdu2.redhat.com [10.11.54.10]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id C5BED1C05140; Sat, 4 Jun 2022 03:59:07 +0000 (UTC) Received: from T590 (ovpn-8-17.pek2.redhat.com [10.72.8.17]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 0243F492C3B; Sat, 4 Jun 2022 03:59:02 +0000 (UTC) Date: Sat, 4 Jun 2022 11:58:57 +0800 From: Ming Lei To: Waiman Long Cc: Tejun Heo , Jens Axboe , cgroups@vger.kernel.org, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, ming.lei@redhat.com Subject: Re: [PATCH v6 3/3] blk-cgroup: Optimize blkcg_rstat_flush() Message-ID: References: <20220602192020.166940-1-longman@redhat.com> <20220602192020.166940-4-longman@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20220602192020.166940-4-longman@redhat.com> X-Scanned-By: MIMEDefang 2.85 on 10.11.54.10 X-Spam-Status: No, score=-3.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,RDNS_NONE,SPF_HELO_NONE,T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Waiman, On Thu, Jun 02, 2022 at 03:20:20PM -0400, Waiman Long wrote: > For a system with many CPUs and block devices, the time to do > blkcg_rstat_flush() from cgroup_rstat_flush() can be rather long. It > can be especially problematic as interrupt is disabled during the flush. > It was reported that it might take seconds to complete in some extreme > cases leading to hard lockup messages. > > As it is likely that not all the percpu blkg_iostat_set's has been > updated since the last flush, those stale blkg_iostat_set's don't need > to be flushed in this case. This patch optimizes blkcg_rstat_flush() > by keeping a lockless list of recently updated blkg_iostat_set's in a > newly added percpu blkcg->lhead pointer. > > The blkg_iostat_set is added to the lockless list on the update side > in blk_cgroup_bio_start(). It is removed from the lockless list when > flushed in blkcg_rstat_flush(). Due to racing, it is possible that > blk_iostat_set's in the lockless list may have no new IO stats to be > flushed. To protect against destruction of blkg, a percpu reference is > gotten when putting into the lockless list and put back when removed. > > A blkg_iostat_set can determine if it is in a lockless list by checking > the content of its lnode.next pointer which will be non-NULL when in > a lockless list. This requires the presence of a special llist_last > sentinel node to be put at the end of the lockless list. > > When booting up an instrumented test kernel with this patch on a > 2-socket 96-thread system with cgroup v2, out of the 2051 calls to > cgroup_rstat_flush() after bootup, 1788 of the calls were exited > immediately because of empty lockless list. After an all-cpu kernel > build, the ratio became 6295424/6340513. That was more than 99%. > > Signed-off-by: Waiman Long > Acked-by: Tejun Heo > --- > block/blk-cgroup.c | 100 ++++++++++++++++++++++++++++++++++++++++++--- > block/blk-cgroup.h | 9 ++++ > 2 files changed, 103 insertions(+), 6 deletions(-) > > diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c > index 9021f75fc752..963a779c4cab 100644 > --- a/block/blk-cgroup.c > +++ b/block/blk-cgroup.c > @@ -59,6 +59,71 @@ static struct workqueue_struct *blkcg_punt_bio_wq; > > #define BLKG_DESTROY_BATCH_SIZE 64 > > +/* > + * Lockless lists for tracking IO stats update > + * > + * New IO stats are stored in the percpu iostat_cpu within blkcg_gq (blkg). > + * There are multiple blkg's (one for each block device) attached to each > + * blkcg. The rstat code keeps track of which cpu has IO stats updated, > + * but it doesn't know which blkg has the updated stats. If there are many > + * block devices in a system, the cost of iterating all the blkg's to flush > + * out the IO stats can be high. To reduce such overhead, a set of percpu > + * lockless lists (lhead) per blkcg are used to track the set of recently > + * updated iostat_cpu's since the last flush. An iostat_cpu will be put > + * onto the lockless list on the update side [blk_cgroup_bio_start()] if > + * not there yet and then removed when being flushed [blkcg_rstat_flush()]. > + * References to blkg are gotten and then put back in the process to > + * protect against blkg removal. > + * > + * lnode.next of the last entry in a lockless list is NULL. To enable us to > + * use lnode.next as a boolean flag to indicate its presence in a lockless > + * list, we have to make it non-NULL for all. This is done by using a > + * sentinel node at the end of the lockless list. All the percpu lhead's > + * are initialized to point to that sentinel node as being empty. > + */ > +static struct llist_node llist_last; > + > +static bool blkcg_llist_empty(struct llist_head *lhead) > +{ > + return lhead->first == &llist_last; > +} > + > +static void init_blkcg_llists(struct blkcg *blkcg) > +{ > + int cpu; > + > + for_each_possible_cpu(cpu) > + per_cpu_ptr(blkcg->lhead, cpu)->first = &llist_last; > +} > + > +static struct llist_node *fetch_delete_blkcg_llist(struct llist_head *lhead) > +{ > + return xchg(&lhead->first, &llist_last); > +} > + > +static struct llist_node *fetch_delete_lnode_next(struct llist_node *lnode) > +{ > + struct llist_node *next = READ_ONCE(lnode->next); > + struct blkcg_gq *blkg = llist_entry(lnode, struct blkg_iostat_set, > + lnode)->blkg; > + > + WRITE_ONCE(lnode->next, NULL); > + percpu_ref_put(&blkg->refcnt); > + return next; > +} > + > +/* > + * The retrieved blkg_iostat_set is immediately marked as not in the > + * lockless list by clearing its node->next pointer. It could be put > + * back into the list by a parallel update before the iostat's are > + * finally flushed including probably the new update. > + */ > +#define blkcg_llist_for_each_entry_safe(pos, node, nxt) \ > + for (; (node != &llist_last) && \ > + (pos = llist_entry(node, struct blkg_iostat_set, lnode), \ > + nxt = fetch_delete_lnode_next(node), true); \ > + node = nxt) > + > /** > * blkcg_css - find the current css > * > @@ -236,8 +301,10 @@ static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, struct request_queue *q, > blkg->blkcg = blkcg; > > u64_stats_init(&blkg->iostat.sync); > - for_each_possible_cpu(cpu) > + for_each_possible_cpu(cpu) { > u64_stats_init(&per_cpu_ptr(blkg->iostat_cpu, cpu)->sync); > + per_cpu_ptr(blkg->iostat_cpu, cpu)->blkg = blkg; > + } > > for (i = 0; i < BLKCG_MAX_POLS; i++) { > struct blkcg_policy *pol = blkcg_policy[i]; > @@ -852,17 +919,23 @@ static void blkg_iostat_sub(struct blkg_iostat *dst, struct blkg_iostat *src) > static void blkcg_rstat_flush(struct cgroup_subsys_state *css, int cpu) > { > struct blkcg *blkcg = css_to_blkcg(css); > - struct blkcg_gq *blkg; > + struct llist_head *lhead = per_cpu_ptr(blkcg->lhead, cpu); > + struct llist_node *lnode, *lnext; > + struct blkg_iostat_set *bisc; > > /* Root-level stats are sourced from system-wide IO stats */ > if (!cgroup_parent(css->cgroup)) > return; > > + if (blkcg_llist_empty(lhead)) > + return; > + > rcu_read_lock(); > > - hlist_for_each_entry_rcu(blkg, &blkcg->blkg_list, blkcg_node) { > + lnode = fetch_delete_blkcg_llist(lhead); > + blkcg_llist_for_each_entry_safe(bisc, lnode, lnext) { > + struct blkcg_gq *blkg = bisc->blkg; > struct blkcg_gq *parent = blkg->parent; > - struct blkg_iostat_set *bisc = per_cpu_ptr(blkg->iostat_cpu, cpu); > struct blkg_iostat cur, delta; > unsigned long flags; > unsigned int seq; > @@ -1170,6 +1243,7 @@ static void blkcg_css_free(struct cgroup_subsys_state *css) > > mutex_unlock(&blkcg_pol_mutex); > > + free_percpu(blkcg->lhead); > kfree(blkcg); > } > > @@ -1189,6 +1263,11 @@ blkcg_css_alloc(struct cgroup_subsys_state *parent_css) > goto unlock; > } > > + blkcg->lhead = alloc_percpu_gfp(struct llist_head, GFP_KERNEL); > + if (!blkcg->lhead) > + goto free_blkcg; > + init_blkcg_llists(blkcg); > + > for (i = 0; i < BLKCG_MAX_POLS ; i++) { > struct blkcg_policy *pol = blkcg_policy[i]; > struct blkcg_policy_data *cpd; > @@ -1229,7 +1308,8 @@ blkcg_css_alloc(struct cgroup_subsys_state *parent_css) > for (i--; i >= 0; i--) > if (blkcg->cpd[i]) > blkcg_policy[i]->cpd_free_fn(blkcg->cpd[i]); > - > + free_percpu(blkcg->lhead); > +free_blkcg: > if (blkcg != &blkcg_root) > kfree(blkcg); > unlock: > @@ -1993,6 +2073,7 @@ static int blk_cgroup_io_type(struct bio *bio) > > void blk_cgroup_bio_start(struct bio *bio) > { > + struct blkcg *blkcg = bio->bi_blkg->blkcg; > int rwd = blk_cgroup_io_type(bio), cpu; > struct blkg_iostat_set *bis; > unsigned long flags; > @@ -2011,9 +2092,16 @@ void blk_cgroup_bio_start(struct bio *bio) > } > bis->cur.ios[rwd]++; > > + if (!READ_ONCE(bis->lnode.next)) { > + struct llist_head *lhead = per_cpu_ptr(blkcg->lhead, cpu); > + > + llist_add(&bis->lnode, lhead); > + percpu_ref_get(&bis->blkg->refcnt); > + } The above still adds cost in fast io path. > + > u64_stats_update_end_irqrestore(&bis->sync, flags); > if (cgroup_subsys_on_dfl(io_cgrp_subsys)) > - cgroup_rstat_updated(bio->bi_blkg->blkcg->css.cgroup, cpu); > + cgroup_rstat_updated(blkcg->css.cgroup, cpu); > put_cpu(); > } IMO, it seems one cgroup generic issue. More importantly, the percpu lock of cgroup_rstat_cpu_lock is held in both cgroup_rstat_updated() and cgroup_rstat_flush_locked(), which can provide enough sync with zero extra cost, meantime other cgroups can benefit from this kind of much simpler improvement. So what do you think of the following approach? BTW, the cpumask can be replaced with one plain percpu variable for avoiding cache conflict. diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c index 23ec30f50cca..f8287fced726 100644 --- a/block/blk-cgroup.c +++ b/block/blk-cgroup.c @@ -858,6 +858,11 @@ static void blkcg_rstat_flush(struct cgroup_subsys_state *css, int cpu) if (!cgroup_parent(css->cgroup)) return; + if (!cpumask_test_cpu(cpu, blkcg->iostat_cpumask)) + return; + + cpumask_clear_cpu(cpu, blkcg->iostat_cpumask); + rcu_read_lock(); hlist_for_each_entry_rcu(blkg, &blkcg->blkg_list, blkcg_node) { @@ -1170,6 +1175,7 @@ static void blkcg_css_free(struct cgroup_subsys_state *css) mutex_unlock(&blkcg_pol_mutex); + free_cpumask_var(blkcg->iostat_cpumask); kfree(blkcg); } @@ -1213,6 +1219,9 @@ blkcg_css_alloc(struct cgroup_subsys_state *parent_css) pol->cpd_init_fn(cpd); } + if (!zalloc_cpumask_var(&blkcg->iostat_cpumask, GFP_KERNEL)) + goto free_pd_blkcg; + spin_lock_init(&blkcg->lock); refcount_set(&blkcg->online_pin, 1); INIT_RADIX_TREE(&blkcg->blkg_tree, GFP_NOWAIT | __GFP_NOWARN); @@ -2009,7 +2018,8 @@ void blk_cgroup_bio_start(struct bio *bio) u64_stats_update_end_irqrestore(&bis->sync, flags); if (cgroup_subsys_on_dfl(io_cgrp_subsys)) - cgroup_rstat_updated(bio->bi_blkg->blkcg->css.cgroup, cpu); + cgroup_rstat_updated(bio->bi_blkg->blkcg->css.cgroup, cpu, + bio->bi_blkg->blkcg->iostat_cpumask); put_cpu(); } diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h index d4de0a35e066..458b40ca045a 100644 --- a/block/blk-cgroup.h +++ b/block/blk-cgroup.h @@ -103,6 +103,7 @@ struct blkcg { #ifdef CONFIG_CGROUP_WRITEBACK struct list_head cgwb_list; #endif + cpumask_var_t iostat_cpumask; }; static inline struct blkcg *css_to_blkcg(struct cgroup_subsys_state *css) diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h index 0d1ada8968d7..4fa5dde3a62c 100644 --- a/include/linux/cgroup.h +++ b/include/linux/cgroup.h @@ -763,7 +763,7 @@ static inline struct cgroup *cgroup_get_from_id(u64 id) /* * cgroup scalable recursive statistics. */ -void cgroup_rstat_updated(struct cgroup *cgrp, int cpu); +void cgroup_rstat_updated(struct cgroup *cgrp, int cpu, cpumask_var_t cpumask); void cgroup_rstat_flush(struct cgroup *cgrp); void cgroup_rstat_flush_irqsafe(struct cgroup *cgrp); void cgroup_rstat_flush_hold(struct cgroup *cgrp); diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c index 24b5c2ab5598..f4eb63b86e56 100644 --- a/kernel/cgroup/rstat.c +++ b/kernel/cgroup/rstat.c @@ -22,7 +22,7 @@ static struct cgroup_rstat_cpu *cgroup_rstat_cpu(struct cgroup *cgrp, int cpu) * rstat_cpu->updated_children list. See the comment on top of * cgroup_rstat_cpu definition for details. */ -void cgroup_rstat_updated(struct cgroup *cgrp, int cpu) +void cgroup_rstat_updated(struct cgroup *cgrp, int cpu, cpumask_var_t cpumask) { raw_spinlock_t *cpu_lock = per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu); unsigned long flags; @@ -40,6 +40,9 @@ void cgroup_rstat_updated(struct cgroup *cgrp, int cpu) raw_spin_lock_irqsave(cpu_lock, flags); + if (cpumask) + cpumask_set_cpu(cpu, cpumask); + /* put @cgrp and all ancestors on the corresponding updated lists */ while (true) { struct cgroup_rstat_cpu *rstatc = cgroup_rstat_cpu(cgrp, cpu); @@ -366,7 +369,7 @@ static void cgroup_base_stat_cputime_account_end(struct cgroup *cgrp, unsigned long flags) { u64_stats_update_end_irqrestore(&rstatc->bsync, flags); - cgroup_rstat_updated(cgrp, smp_processor_id()); + cgroup_rstat_updated(cgrp, smp_processor_id(), NULL); put_cpu_ptr(rstatc); } diff --git a/mm/memcontrol.c b/mm/memcontrol.c index abec50f31fe6..8c4f204dbf5b 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -622,7 +622,7 @@ static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val) { unsigned int x; - cgroup_rstat_updated(memcg->css.cgroup, smp_processor_id()); + cgroup_rstat_updated(memcg->css.cgroup, smp_processor_id(), NULL); x = __this_cpu_add_return(stats_updates, abs(val)); if (x > MEMCG_CHARGE_BATCH) { Thanks, Ming