by Michal Koutný

[permalink] [raw]

Subject: Re: [PATCH v6 3/3] blk-cgroup: Optimize blkcg_rstat_flush()

On Wed, Jun 08, 2022 at 11:12:55PM +0200, Michal Koutn? <[email protected]> wrote:
> Wouldn't that mean submitting a bio from offlined blkcg?
> blkg_tryget_closest() should prevent that.

Self-correction -- no, forgot blkg_tryget_closest() gets any non-zero
reference, not just a live one (percpu_ref_tryget_live()), furthermore,
I can see that offlined blkcg may still issue writeback bios for
instance.

> > I guess one possible solution may be to abandon the llist and revert
> > back to list iteration when offline. I need to think a bit more about
> > that.

Since blkcg stats are only used for io.stat of an online blkcg, the
update may be skipped on an offlined blkcg. (Which of course breaks when
something starts to depend on the stats of an offlined blkcg.)

Michal

2022-09-30 18:54:13

by Waiman Long

[permalink] [raw]

Subject: Re: [PATCH v6 3/3] blk-cgroup: Optimize blkcg_rstat_flush()

On 6/8/22 12:57, Michal Koutný wrote:
> @@ -2011,9 +2092,16 @@ void blk_cgroup_bio_start(struct bio *bio)
>> }
>> bis->cur.ios[rwd]++;
>>
>> + if (!READ_ONCE(bis->lnode.next)) {
>> + struct llist_head *lhead = per_cpu_ptr(blkcg->lhead, cpu);
>> +
>> + llist_add(&bis->lnode, lhead);
>> + percpu_ref_get(&bis->blkg->refcnt);
>> + }
>> +
> When a blkg's cgroup is rmdir'd, what happens with the lhead list?
> We have cgroup_rstat_exit() in css_free_rwork_fn() that ultimately flushes rstats.
> init_and_link_css however adds reference form blkcg->css to cgroup->css.
> The blkcg->css would be (transitively) pinned by the lhead list and
> hence would prevent the final flush (when refs drop to zero). Seems like
> a cyclic dependency.

That is not true. The percpu lhead list is embedded in blkcg but it does
not pin blkcg. What the code does is to pin the blkg from being freed
while it is on the lockless list. I do need to move the percpu_ref_put()
in blkcg_rstat_flush() later to avoid use-after-free though.

>
> Luckily, there's also per-subsys flushing in css_release which could be
> moved after rmdir (offlining) but before last ref is gone:
>
> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index adb820e98f24..d830e6a8fb3b 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -5165,11 +5165,6 @@ static void css_release_work_fn(struct work_struct *work)
>
> if (ss) {
> /* css release path */
> - if (!list_empty(&css->rstat_css_node)) {
> - cgroup_rstat_flush(cgrp);
> - list_del_rcu(&css->rstat_css_node);
> - }
> -
> cgroup_idr_replace(&ss->css_idr, NULL, css->id);
> if (ss->css_released)
> ss->css_released(css);
> @@ -5279,6 +5274,11 @@ static void offline_css(struct cgroup_subsys_state *css)
> css->flags &= ~CSS_ONLINE;
> RCU_INIT_POINTER(css->cgroup->subsys[ss->id], NULL);
>
> + if (!list_empty(&css->rstat_css_node)) {
> + cgroup_rstat_flush(css->cgrp);
> + list_del_rcu(&css->rstat_css_node);
> + }
> +
> wake_up_all(&css->cgroup->offline_waitq);
> }
>
> (not tested)

I don't think that code is necessary. Anyway, I am planning go make a
parallel set of helpers for a lockless list with sentinel variant as
suggested.

Thanks,
Longman