Received-SPF: softfail (google.com: domain of transitioning linux-kernel-owner@vger.kernel.org does not designate 23.128.96.19 as permitted sender) client-ip=23.128.96.19;
Date:   Mon, 6 Jun 2022 09:39:02 +0800
From:   Ming Lei <ming.lei@redhat.com>
To:     Waiman Long <longman@redhat.com>
Cc:     Tejun Heo <tj@kernel.org>, Jens Axboe <axboe@kernel.dk>,
        cgroups@vger.kernel.org, linux-block@vger.kernel.org,
        linux-kernel@vger.kernel.org, ming.lei@redhat.com
Subject: Re: [PATCH v6 3/3] blk-cgroup: Optimize blkcg_rstat_flush()
Message-ID: <Yp1atoLkZPvA1Zd3@T590>
References: <20220602192020.166940-1-longman@redhat.com>
 <20220602192020.166940-4-longman@redhat.com>
 <YprYgdV0IxAeJZsz@T590>
 <ee754359-cefd-7d4b-7861-1405860bba9b@redhat.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <ee754359-cefd-7d4b-7861-1405860bba9b@redhat.com>
Precedence: bulk

On Sun, Jun 05, 2022 at 07:15:27PM -0400, Waiman Long wrote:
> On 6/3/22 23:58, Ming Lei wrote:
> > Hi Waiman,
> > 
> > On Thu, Jun 02, 2022 at 03:20:20PM -0400, Waiman Long wrote:
> > > For a system with many CPUs and block devices, the time to do
> > > blkcg_rstat_flush() from cgroup_rstat_flush() can be rather long. It
> > > can be especially problematic as interrupt is disabled during the flush.
> > > It was reported that it might take seconds to complete in some extreme
> > > cases leading to hard lockup messages.
> > > 
> > > As it is likely that not all the percpu blkg_iostat_set's has been
> > > updated since the last flush, those stale blkg_iostat_set's don't need
> > > to be flushed in this case. This patch optimizes blkcg_rstat_flush()
> > > by keeping a lockless list of recently updated blkg_iostat_set's in a
> > > newly added percpu blkcg->lhead pointer.
> > > 
> > > The blkg_iostat_set is added to the lockless list on the update side
> > > in blk_cgroup_bio_start(). It is removed from the lockless list when
> > > flushed in blkcg_rstat_flush(). Due to racing, it is possible that
> > > blk_iostat_set's in the lockless list may have no new IO stats to be
> > > flushed. To protect against destruction of blkg, a percpu reference is
> > > gotten when putting into the lockless list and put back when removed.
> > > 
> > > A blkg_iostat_set can determine if it is in a lockless list by checking
> > > the content of its lnode.next pointer which will be non-NULL when in
> > > a lockless list. This requires the presence of a special llist_last
> > > sentinel node to be put at the end of the lockless list.
> > > 
> > > When booting up an instrumented test kernel with this patch on a
> > > 2-socket 96-thread system with cgroup v2, out of the 2051 calls to
> > > cgroup_rstat_flush() after bootup, 1788 of the calls were exited
> > > immediately because of empty lockless list. After an all-cpu kernel
> > > build, the ratio became 6295424/6340513. That was more than 99%.
> > > 
> > > Signed-off-by: Waiman Long <longman@redhat.com>
> > > Acked-by: Tejun Heo <tj@kernel.org>
> > > ---
> > >   block/blk-cgroup.c | 100 ++++++++++++++++++++++++++++++++++++++++++---
> > >   block/blk-cgroup.h |   9 ++++
> > >   2 files changed, 103 insertions(+), 6 deletions(-)
> > > 
> > > diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> > > index 9021f75fc752..963a779c4cab 100644
> > > --- a/block/blk-cgroup.c
> > > +++ b/block/blk-cgroup.c
> > > @@ -59,6 +59,71 @@ static struct workqueue_struct *blkcg_punt_bio_wq;
> > >   #define BLKG_DESTROY_BATCH_SIZE  64
> > > +/*
> > > + * Lockless lists for tracking IO stats update
> > > + *
> > > + * New IO stats are stored in the percpu iostat_cpu within blkcg_gq (blkg).
> > > + * There are multiple blkg's (one for each block device) attached to each
> > > + * blkcg. The rstat code keeps track of which cpu has IO stats updated,
> > > + * but it doesn't know which blkg has the updated stats. If there are many
> > > + * block devices in a system, the cost of iterating all the blkg's to flush
> > > + * out the IO stats can be high. To reduce such overhead, a set of percpu
> > > + * lockless lists (lhead) per blkcg are used to track the set of recently
> > > + * updated iostat_cpu's since the last flush. An iostat_cpu will be put
> > > + * onto the lockless list on the update side [blk_cgroup_bio_start()] if
> > > + * not there yet and then removed when being flushed [blkcg_rstat_flush()].
> > > + * References to blkg are gotten and then put back in the process to
> > > + * protect against blkg removal.
> > > + *
> > > + * lnode.next of the last entry in a lockless list is NULL. To enable us to
> > > + * use lnode.next as a boolean flag to indicate its presence in a lockless
> > > + * list, we have to make it non-NULL for all. This is done by using a
> > > + * sentinel node at the end of the lockless list. All the percpu lhead's
> > > + * are initialized to point to that sentinel node as being empty.
> > > + */
> > > +static struct llist_node llist_last;
> > > +
> > > +static bool blkcg_llist_empty(struct llist_head *lhead)
> > > +{
> > > +	return lhead->first == &llist_last;
> > > +}
> > > +
> > > +static void init_blkcg_llists(struct blkcg *blkcg)
> > > +{
> > > +	int cpu;
> > > +
> > > +	for_each_possible_cpu(cpu)
> > > +		per_cpu_ptr(blkcg->lhead, cpu)->first = &llist_last;
> > > +}
> > > +
> > > +static struct llist_node *fetch_delete_blkcg_llist(struct llist_head *lhead)
> > > +{
> > > +	return xchg(&lhead->first, &llist_last);
> > > +}
> > > +
> > > +static struct llist_node *fetch_delete_lnode_next(struct llist_node *lnode)
> > > +{
> > > +	struct llist_node *next = READ_ONCE(lnode->next);
> > > +	struct blkcg_gq *blkg = llist_entry(lnode, struct blkg_iostat_set,
> > > +					    lnode)->blkg;
> > > +
> > > +	WRITE_ONCE(lnode->next, NULL);
> > > +	percpu_ref_put(&blkg->refcnt);
> > > +	return next;
> > > +}
> > > +
> > > +/*
> > > + * The retrieved blkg_iostat_set is immediately marked as not in the
> > > + * lockless list by clearing its node->next pointer. It could be put
> > > + * back into the list by a parallel update before the iostat's are
> > > + * finally flushed including probably the new update.
> > > + */
> > > +#define blkcg_llist_for_each_entry_safe(pos, node, nxt)			\
> > > +	for (; (node != &llist_last) &&					\
> > > +	       (pos = llist_entry(node, struct blkg_iostat_set, lnode),	\
> > > +		nxt = fetch_delete_lnode_next(node), true);		\
> > > +		node = nxt)
> > > +
> > >   /**
> > >    * blkcg_css - find the current css
> > >    *
> > > @@ -236,8 +301,10 @@ static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, struct request_queue *q,
> > >   	blkg->blkcg = blkcg;
> > >   	u64_stats_init(&blkg->iostat.sync);
> > > -	for_each_possible_cpu(cpu)
> > > +	for_each_possible_cpu(cpu) {
> > >   		u64_stats_init(&per_cpu_ptr(blkg->iostat_cpu, cpu)->sync);
> > > +		per_cpu_ptr(blkg->iostat_cpu, cpu)->blkg = blkg;
> > > +	}
> > >   	for (i = 0; i < BLKCG_MAX_POLS; i++) {
> > >   		struct blkcg_policy *pol = blkcg_policy[i];
> > > @@ -852,17 +919,23 @@ static void blkg_iostat_sub(struct blkg_iostat *dst, struct blkg_iostat *src)
> > >   static void blkcg_rstat_flush(struct cgroup_subsys_state *css, int cpu)
> > >   {
> > >   	struct blkcg *blkcg = css_to_blkcg(css);
> > > -	struct blkcg_gq *blkg;
> > > +	struct llist_head *lhead = per_cpu_ptr(blkcg->lhead, cpu);
> > > +	struct llist_node *lnode, *lnext;
> > > +	struct blkg_iostat_set *bisc;
> > >   	/* Root-level stats are sourced from system-wide IO stats */
> > >   	if (!cgroup_parent(css->cgroup))
> > >   		return;
> > > +	if (blkcg_llist_empty(lhead))
> > > +		return;
> > > +
> > >   	rcu_read_lock();
> > > -	hlist_for_each_entry_rcu(blkg, &blkcg->blkg_list, blkcg_node) {
> > > +	lnode = fetch_delete_blkcg_llist(lhead);
> > > +	blkcg_llist_for_each_entry_safe(bisc, lnode, lnext) {
> > > +		struct blkcg_gq *blkg = bisc->blkg;
> > >   		struct blkcg_gq *parent = blkg->parent;
> > > -		struct blkg_iostat_set *bisc = per_cpu_ptr(blkg->iostat_cpu, cpu);
> > >   		struct blkg_iostat cur, delta;
> > >   		unsigned long flags;
> > >   		unsigned int seq;
> > > @@ -1170,6 +1243,7 @@ static void blkcg_css_free(struct cgroup_subsys_state *css)
> > >   	mutex_unlock(&blkcg_pol_mutex);
> > > +	free_percpu(blkcg->lhead);
> > >   	kfree(blkcg);
> > >   }
> > > @@ -1189,6 +1263,11 @@ blkcg_css_alloc(struct cgroup_subsys_state *parent_css)
> > >   			goto unlock;
> > >   	}
> > > +	blkcg->lhead = alloc_percpu_gfp(struct llist_head, GFP_KERNEL);
> > > +	if (!blkcg->lhead)
> > > +		goto free_blkcg;
> > > +	init_blkcg_llists(blkcg);
> > > +
> > >   	for (i = 0; i < BLKCG_MAX_POLS ; i++) {
> > >   		struct blkcg_policy *pol = blkcg_policy[i];
> > >   		struct blkcg_policy_data *cpd;
> > > @@ -1229,7 +1308,8 @@ blkcg_css_alloc(struct cgroup_subsys_state *parent_css)
> > >   	for (i--; i >= 0; i--)
> > >   		if (blkcg->cpd[i])
> > >   			blkcg_policy[i]->cpd_free_fn(blkcg->cpd[i]);
> > > -
> > > +	free_percpu(blkcg->lhead);
> > > +free_blkcg:
> > >   	if (blkcg != &blkcg_root)
> > >   		kfree(blkcg);
> > >   unlock:
> > > @@ -1993,6 +2073,7 @@ static int blk_cgroup_io_type(struct bio *bio)
> > >   void blk_cgroup_bio_start(struct bio *bio)
> > >   {
> > > +	struct blkcg *blkcg = bio->bi_blkg->blkcg;
> > >   	int rwd = blk_cgroup_io_type(bio), cpu;
> > >   	struct blkg_iostat_set *bis;
> > >   	unsigned long flags;
> > > @@ -2011,9 +2092,16 @@ void blk_cgroup_bio_start(struct bio *bio)
> > >   	}
> > >   	bis->cur.ios[rwd]++;
> > > +	if (!READ_ONCE(bis->lnode.next)) {
> > > +		struct llist_head *lhead = per_cpu_ptr(blkcg->lhead, cpu);
> > > +
> > > +		llist_add(&bis->lnode, lhead);
> > > +		percpu_ref_get(&bis->blkg->refcnt);
> > > +	}
> > The above still adds cost in fast io path.
> 
> That is true, but it depends on how often is cgroup_rstat_flush*() is
> called. There is a one time setup cost after a flush. Subsequent IO ops on
> the same device and cpu will have negligible cost.

OK.

> 
> 
> > 
> > > +
> > >   	u64_stats_update_end_irqrestore(&bis->sync, flags);
> > >   	if (cgroup_subsys_on_dfl(io_cgrp_subsys))
> > > -		cgroup_rstat_updated(bio->bi_blkg->blkcg->css.cgroup, cpu);
> > > +		cgroup_rstat_updated(blkcg->css.cgroup, cpu);
> > >   	put_cpu();
> > >   }
> > IMO, it seems one cgroup generic issue. More importantly, the percpu
> > lock of cgroup_rstat_cpu_lock is held in both cgroup_rstat_updated()
> > and cgroup_rstat_flush_locked(), which can provide enough sync with
> > zero extra cost, meantime other cgroups can benefit from this kind of
> > much simpler improvement.
> > 
> > So what do you think of the following approach?
> > 
> > BTW, the cpumask can be replaced with one plain percpu variable for avoiding
> > cache conflict.
> > 
> > diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> > index 23ec30f50cca..f8287fced726 100644
> > --- a/block/blk-cgroup.c
> > +++ b/block/blk-cgroup.c
> > @@ -858,6 +858,11 @@ static void blkcg_rstat_flush(struct cgroup_subsys_state *css, int cpu)
> >   	if (!cgroup_parent(css->cgroup))
> >   		return;
> > +	if (!cpumask_test_cpu(cpu, blkcg->iostat_cpumask))
> > +		return;
> > +
> > +	cpumask_clear_cpu(cpu, blkcg->iostat_cpumask);
> > +
> >   	rcu_read_lock();
> >   	hlist_for_each_entry_rcu(blkg, &blkcg->blkg_list, blkcg_node) {
> > @@ -1170,6 +1175,7 @@ static void blkcg_css_free(struct cgroup_subsys_state *css)
> >   	mutex_unlock(&blkcg_pol_mutex);
> > +	free_cpumask_var(blkcg->iostat_cpumask);
> >   	kfree(blkcg);
> >   }
> > @@ -1213,6 +1219,9 @@ blkcg_css_alloc(struct cgroup_subsys_state *parent_css)
> >   			pol->cpd_init_fn(cpd);
> >   	}
> > +	if (!zalloc_cpumask_var(&blkcg->iostat_cpumask, GFP_KERNEL))
> > +		goto free_pd_blkcg;
> > +
> >   	spin_lock_init(&blkcg->lock);
> >   	refcount_set(&blkcg->online_pin, 1);
> >   	INIT_RADIX_TREE(&blkcg->blkg_tree, GFP_NOWAIT | __GFP_NOWARN);
> > @@ -2009,7 +2018,8 @@ void blk_cgroup_bio_start(struct bio *bio)
> >   	u64_stats_update_end_irqrestore(&bis->sync, flags);
> >   	if (cgroup_subsys_on_dfl(io_cgrp_subsys))
> > -		cgroup_rstat_updated(bio->bi_blkg->blkcg->css.cgroup, cpu);
> > +		cgroup_rstat_updated(bio->bi_blkg->blkcg->css.cgroup, cpu,
> > +				bio->bi_blkg->blkcg->iostat_cpumask);
> >   	put_cpu();
> >   }
> > diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
> > index d4de0a35e066..458b40ca045a 100644
> > --- a/block/blk-cgroup.h
> > +++ b/block/blk-cgroup.h
> > @@ -103,6 +103,7 @@ struct blkcg {
> >   #ifdef CONFIG_CGROUP_WRITEBACK
> >   	struct list_head		cgwb_list;
> >   #endif
> > +	cpumask_var_t			iostat_cpumask;
> >   };
> >   static inline struct blkcg *css_to_blkcg(struct cgroup_subsys_state *css)
> > diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> > index 0d1ada8968d7..4fa5dde3a62c 100644
> > --- a/include/linux/cgroup.h
> > +++ b/include/linux/cgroup.h
> > @@ -763,7 +763,7 @@ static inline struct cgroup *cgroup_get_from_id(u64 id)
> >   /*
> >    * cgroup scalable recursive statistics.
> >    */
> > -void cgroup_rstat_updated(struct cgroup *cgrp, int cpu);
> > +void cgroup_rstat_updated(struct cgroup *cgrp, int cpu, cpumask_var_t cpumask);
> >   void cgroup_rstat_flush(struct cgroup *cgrp);
> >   void cgroup_rstat_flush_irqsafe(struct cgroup *cgrp);
> >   void cgroup_rstat_flush_hold(struct cgroup *cgrp);
> > diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
> > index 24b5c2ab5598..f4eb63b86e56 100644
> > --- a/kernel/cgroup/rstat.c
> > +++ b/kernel/cgroup/rstat.c
> > @@ -22,7 +22,7 @@ static struct cgroup_rstat_cpu *cgroup_rstat_cpu(struct cgroup *cgrp, int cpu)
> >    * rstat_cpu->updated_children list.  See the comment on top of
> >    * cgroup_rstat_cpu definition for details.
> >    */
> > -void cgroup_rstat_updated(struct cgroup *cgrp, int cpu)
> > +void cgroup_rstat_updated(struct cgroup *cgrp, int cpu, cpumask_var_t cpumask)
> >   {
> >   	raw_spinlock_t *cpu_lock = per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu);
> >   	unsigned long flags;
> > @@ -40,6 +40,9 @@ void cgroup_rstat_updated(struct cgroup *cgrp, int cpu)
> >   	raw_spin_lock_irqsave(cpu_lock, flags);
> > +	if (cpumask)
> > +		cpumask_set_cpu(cpu, cpumask);
> > +
> >   	/* put @cgrp and all ancestors on the corresponding updated lists */
> >   	while (true) {
> >   		struct cgroup_rstat_cpu *rstatc = cgroup_rstat_cpu(cgrp, cpu);
> > @@ -366,7 +369,7 @@ static void cgroup_base_stat_cputime_account_end(struct cgroup *cgrp,
> >   						 unsigned long flags)
> >   {
> >   	u64_stats_update_end_irqrestore(&rstatc->bsync, flags);
> > -	cgroup_rstat_updated(cgrp, smp_processor_id());
> > +	cgroup_rstat_updated(cgrp, smp_processor_id(), NULL);
> >   	put_cpu_ptr(rstatc);
> >   }
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index abec50f31fe6..8c4f204dbf5b 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -622,7 +622,7 @@ static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val)
> >   {
> >   	unsigned int x;
> > -	cgroup_rstat_updated(memcg->css.cgroup, smp_processor_id());
> > +	cgroup_rstat_updated(memcg->css.cgroup, smp_processor_id(), NULL);
> >   	x = __this_cpu_add_return(stats_updates, abs(val));
> >   	if (x > MEMCG_CHARGE_BATCH) {
> 
> I think the rstat set of functions are doing that already. So flush will
> only call CPUs that have called cgroup_rstat_updated() before. However, one

Yeah, I guess the detail is in cgroup_rstat_cpu_pop_updated(), but the
percpu lock(raw_spin_lock_irqsave) is still required, and cgroup_rstat_cpu_pop_updated()
is still called even through there isn't any update on this CPU.

> deficiency that I am aware of is that there is no bitmap of which controller
> have update. The problem that I saw in cgroup v2 is that in a cgroup with
> both memory controller and block controller enabled, a
> cgroup_rstat_updated() call from memory cgroup later causes the rstat
> function to call into block cgroup flush method even though there is no
> update in the block controller. This is an area that needs improvement.
> 
> Your code does allow the block controller to be aware of that and avoid
> further action, but I think it has to be done in the rstat code to be
> applicable to all controllers instead of just specific to block controller.

I guess it can be done by adding one percpu variable to 'struct cgroup'.

> 
> There is another problem that this approach. Suppose the system have 20
> block devices and one of them has an IO operation. Now the flush method
> still needs to iterate all the 20 blkg's to do an update. The block
> controller is kind of special that the number of per-cgroup IO stats depends
> on the number of block devices present. Other controllers just have one set
> of stats per cgroup.

Yeah, and this one is really blkio specific issue, and your patch does
cover this one. Maybe you can add one callback to
cgroup_rstat_updated(), so the "blkg_iostat_set" instance is added into
percpu list under percpu lock of cgroup_rstat_cpu_lock, then the lockless
list isn't needed.


Thanks,
Ming