Hi Waiman,
I've noticed that on recent kernels io.stat metrics don't propagate
all the way up the hierarchy. Specifically, io.stat metrics of some
leaf cgroup will be propagated to the parent, but not its grandparent.
For a simple repro, run the following:
systemd-run --slice test-test dd if=/dev/urandom of=/tmp/test bs=4096 count=1
Then:
cat /sys/fs/cgroup/test.slice/test-test.slice/io.stat
Shows the parent cgroup stats and I see wbytes=4096 but the grandparent cgroup:
cat /sys/fs/cgroup/test.slice/io.stat
shows no writes.
I believe this was caused by the change in "blk-cgroup: Optimize
blkcg_rstat_flush()". When blkcg_rstat_flush is called on the parent
cgroup, it exits early because the lockless list is empty since the
parent cgroup never issued writes itself (e.g. in
blk_cgroup_bio_start). However, in doing so it never propagated stats
to its parent.
Can you confirm if my understanding of the logic here is correct and
advise on a fix?
On 5/14/24 15:25, Dan Schatzberg wrote:
> Hi Waiman,
>
> I've noticed that on recent kernels io.stat metrics don't propagate
> all the way up the hierarchy. Specifically, io.stat metrics of some
> leaf cgroup will be propagated to the parent, but not its grandparent.
>
> For a simple repro, run the following:
>
> systemd-run --slice test-test dd if=/dev/urandom of=/tmp/test bs=4096 count=1
>
> Then:
>
> cat /sys/fs/cgroup/test.slice/test-test.slice/io.stat
>
> Shows the parent cgroup stats and I see wbytes=4096 but the grandparent cgroup:
>
> cat /sys/fs/cgroup/test.slice/io.stat
>
> shows no writes.
>
> I believe this was caused by the change in "blk-cgroup: Optimize
> blkcg_rstat_flush()". When blkcg_rstat_flush is called on the parent
> cgroup, it exits early because the lockless list is empty since the
> parent cgroup never issued writes itself (e.g. in
> blk_cgroup_bio_start). However, in doing so it never propagated stats
> to its parent.
>
> Can you confirm if my understanding of the logic here is correct and
> advise on a fix?
Yes, I believe your analysis is correct. Thanks for spotting this iostat
propagation problem.
I am working on a fix to address this problem and will post a patch once
I have finished my testing.
Thanks,
Longman
On 5/14/24 23:59, Waiman Long wrote:
> On 5/14/24 15:25, Dan Schatzberg wrote:
>> Hi Waiman,
>>
>> I've noticed that on recent kernels io.stat metrics don't propagate
>> all the way up the hierarchy. Specifically, io.stat metrics of some
>> leaf cgroup will be propagated to the parent, but not its grandparent.
>>
>> For a simple repro, run the following:
>>
>> systemd-run --slice test-test dd if=/dev/urandom of=/tmp/test bs=4096
>> count=1
>>
>> Then:
>>
>> cat /sys/fs/cgroup/test.slice/test-test.slice/io.stat
>>
>> Shows the parent cgroup stats and I see wbytes=4096 but the
>> grandparent cgroup:
>>
>> cat /sys/fs/cgroup/test.slice/io.stat
>>
>> shows no writes.
>>
>> I believe this was caused by the change in "blk-cgroup: Optimize
>> blkcg_rstat_flush()". When blkcg_rstat_flush is called on the parent
>> cgroup, it exits early because the lockless list is empty since the
>> parent cgroup never issued writes itself (e.g. in
>> blk_cgroup_bio_start). However, in doing so it never propagated stats
>> to its parent.
>>
>> Can you confirm if my understanding of the logic here is correct and
>> advise on a fix?
>
> Yes, I believe your analysis is correct. Thanks for spotting this
> iostat propagation problem.
>
> I am working on a fix to address this problem and will post a patch
> once I have finished my testing.
Actually, I can only reproduce the issue with a 3-level
(child-parent-grandparent) cgroup hierarchy below the root cgroup. The
dd command is run test.slice/test-test.slice. So both test.slice/io.stat
and test.slice/test-test.slice/io.stat are properly updated.
Cheers,
Longman
On Wed, May 15, 2024 at 10:26:31AM -0400, Waiman Long wrote:
>
> On 5/14/24 23:59, Waiman Long wrote:
> > On 5/14/24 15:25, Dan Schatzberg wrote:
> > > Hi Waiman,
> > >
> > > I've noticed that on recent kernels io.stat metrics don't propagate
> > > all the way up the hierarchy. Specifically, io.stat metrics of some
> > > leaf cgroup will be propagated to the parent, but not its grandparent.
> > >
> > > For a simple repro, run the following:
> > >
> > > systemd-run --slice test-test dd if=/dev/urandom of=/tmp/test
> > > bs=4096 count=1
> > >
> > > Then:
> > >
> > > cat /sys/fs/cgroup/test.slice/test-test.slice/io.stat
> > >
> > > Shows the parent cgroup stats and I see wbytes=4096 but the
> > > grandparent cgroup:
> > >
> > > cat /sys/fs/cgroup/test.slice/io.stat
> > >
> > > shows no writes.
> > >
> > > I believe this was caused by the change in "blk-cgroup: Optimize
> > > blkcg_rstat_flush()". When blkcg_rstat_flush is called on the parent
> > > cgroup, it exits early because the lockless list is empty since the
> > > parent cgroup never issued writes itself (e.g. in
> > > blk_cgroup_bio_start). However, in doing so it never propagated stats
> > > to its parent.
> > >
> > > Can you confirm if my understanding of the logic here is correct and
> > > advise on a fix?
> >
> > Yes, I believe your analysis is correct. Thanks for spotting this iostat
> > propagation problem.
> >
> > I am working on a fix to address this problem and will post a patch once
> > I have finished my testing.
>
> Actually, I can only reproduce the issue with a 3-level
> (child-parent-grandparent) cgroup hierarchy below the root cgroup. The dd
> command is run test.slice/test-test.slice. So both test.slice/io.stat and
> test.slice/test-test.slice/io.stat are properly updated.
That's correct, this repros with a 3-level cgroup hierarchy (or
more). systemd-run should create an ephemeral .scope cgroup under
test-test.slice and then delete it when the dd command finishes. So
test.slice/test-test.slice was the parent (2nd level) and test.slice
is the grandparent.
On 5/15/24 12:54, Dan Schatzberg wrote:
> On Wed, May 15, 2024 at 10:26:31AM -0400, Waiman Long wrote:
>> On 5/14/24 23:59, Waiman Long wrote:
>>> On 5/14/24 15:25, Dan Schatzberg wrote:
>>>> Hi Waiman,
>>>>
>>>> I've noticed that on recent kernels io.stat metrics don't propagate
>>>> all the way up the hierarchy. Specifically, io.stat metrics of some
>>>> leaf cgroup will be propagated to the parent, but not its grandparent.
>>>>
>>>> For a simple repro, run the following:
>>>>
>>>> systemd-run --slice test-test dd if=/dev/urandom of=/tmp/test
>>>> bs=4096 count=1
>>>>
>>>> Then:
>>>>
>>>> cat /sys/fs/cgroup/test.slice/test-test.slice/io.stat
>>>>
>>>> Shows the parent cgroup stats and I see wbytes=4096 but the
>>>> grandparent cgroup:
>>>>
>>>> cat /sys/fs/cgroup/test.slice/io.stat
>>>>
>>>> shows no writes.
>>>>
>>>> I believe this was caused by the change in "blk-cgroup: Optimize
>>>> blkcg_rstat_flush()". When blkcg_rstat_flush is called on the parent
>>>> cgroup, it exits early because the lockless list is empty since the
>>>> parent cgroup never issued writes itself (e.g. in
>>>> blk_cgroup_bio_start). However, in doing so it never propagated stats
>>>> to its parent.
>>>>
>>>> Can you confirm if my understanding of the logic here is correct and
>>>> advise on a fix?
>>> Yes, I believe your analysis is correct. Thanks for spotting this iostat
>>> propagation problem.
>>>
>>> I am working on a fix to address this problem and will post a patch once
>>> I have finished my testing.
>> Actually, I can only reproduce the issue with a 3-level
>> (child-parent-grandparent) cgroup hierarchy below the root cgroup. The dd
>> command is run test.slice/test-test.slice. So both test.slice/io.stat and
>> test.slice/test-test.slice/io.stat are properly updated.
> That's correct, this repros with a 3-level cgroup hierarchy (or
> more). systemd-run should create an ephemeral .scope cgroup under
> test-test.slice and then delete it when the dd command finishes. So
> test.slice/test-test.slice was the parent (2nd level) and test.slice
> is the grandparent.
OK, I didn't get the .scope sub-cgroup when I ran the above systemd-run
command. Perhaps it's due to a difference in configuration. Anyway, I
was able to reproduce the problem and devise the fix accordingly.
Thanks,
Longman