LinuxLists.cc - [PATCH v2 2/4] mm: correct calculation of wb's bg

2024-04-25 13:27:43

Subject: [PATCH v2 2/4] mm: correct calculation of wb's bg_thresh in cgroup domain

The wb_calc_thresh is supposed to calculate wb's share of bg_thresh in
global domain. To calculate wb's share of bg_thresh in cgroup domain,
it's more reasonable to use __wb_calc_thresh in which way we calculate
dirty_thresh in cgroup domain in balance_dirty_pages().

Consider following domain hierarchy:
global domain (> 20G)
/ \
cgroup domain1(10G) cgroup domain2(10G)
| |
bdi wb1 wb2
Assume wb1 and wb2 has the same bandwidth.
We have global domain bg_thresh > 2G, cgroup domain bg_thresh 1G.
Then we have:
wb's thresh in global domain = 2G * (wb bandwidth) / (system bandwidth)
= 2G * 1/2 = 1G
wb's thresh in cgroup domain = 1G * (wb bandwidth) / (system bandwidth)
= 1G * 1/2 = 0.5G
At last, wb1 and wb2 will be limited at 0.5G, the system will be limited
at 1G which is less than global domain bg_thresh 2G.

Test as following:
/* make it easier to observe the issue */
echo 300000 > /proc/sys/vm/dirty_expire_centisecs
echo 100 > /proc/sys/vm/dirty_writeback_centisecs

/* run fio in wb1 */
cd /sys/fs/cgroup
echo "+memory +io" > cgroup.subtree_control
mkdir group1
cd group1
echo 10G > memory.high
echo 10G > memory.max
echo $$ > cgroup.procs
mkfs.ext4 -F /dev/vdb
mount /dev/vdb /bdi1/
fio -name test -filename=/bdi1/file -size=600M -ioengine=libaio -bs=4K \
-iodepth=1 -rw=write -direct=0 --time_based -runtime=600 -invalidate=0

/* run fio in wb2 with a new shell */
cd /sys/fs/cgroup
mkdir group2
cd group2
echo 10G > memory.high
echo 10G > memory.max
echo $$ > cgroup.procs
mkfs.ext4 -F /dev/vdc
mount /dev/vdc /bdi2/
fio -name test -filename=/bdi2/file -size=600M -ioengine=libaio -bs=4K \
-iodepth=1 -rw=write -direct=0 --time_based -runtime=600 -invalidate=0

Before fix, the wrttien pages of wb1 and wb2 reported from
toos/writeback/wb_monitor.py keep growing. After fix, rare written pages
are accumulated.
There is no obvious change in fio result.

Fixes: 74d369443325 ("writeback: Fix performance regression in wb_over_bg_thresh()")
Signed-off-by: Kemeng Shi <[email protected]>
---
mm/page-writeback.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 2a3b68aae336..14893b20d38c 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2137,7 +2137,7 @@ bool wb_over_bg_thresh(struct bdi_writeback *wb)
if (mdtc->dirty > mdtc->bg_thresh)
return true;

- thresh = wb_calc_thresh(mdtc->wb, mdtc->bg_thresh);
+ thresh = __wb_calc_thresh(mdtc, mdtc->bg_thresh);
if (thresh < 2 * wb_stat_error())
reclaimable = wb_stat_sum(wb, WB_RECLAIMABLE);
else
--
2.30.0

2024-05-03 09:31:16

by Jan Kara

[permalink] [raw]

Subject: Re: [PATCH v2 2/4] mm: correct calculation of wb's bg_thresh in cgroup domain

On Thu 25-04-24 21:17:22, Kemeng Shi wrote:
> The wb_calc_thresh is supposed to calculate wb's share of bg_thresh in
> global domain. To calculate wb's share of bg_thresh in cgroup domain,
> it's more reasonable to use __wb_calc_thresh in which way we calculate
> dirty_thresh in cgroup domain in balance_dirty_pages().
>
> Consider following domain hierarchy:
> global domain (> 20G)
> / \
> cgroup domain1(10G) cgroup domain2(10G)
> | |
> bdi wb1 wb2
> Assume wb1 and wb2 has the same bandwidth.
> We have global domain bg_thresh > 2G, cgroup domain bg_thresh 1G.
> Then we have:
> wb's thresh in global domain = 2G * (wb bandwidth) / (system bandwidth)
> = 2G * 1/2 = 1G
> wb's thresh in cgroup domain = 1G * (wb bandwidth) / (system bandwidth)
> = 1G * 1/2 = 0.5G
> At last, wb1 and wb2 will be limited at 0.5G, the system will be limited
> at 1G which is less than global domain bg_thresh 2G.

This was a bit hard to understand for me so I'd rephrase it as:

wb_calc_thresh() is calculating wb's share of bg_thresh in the global
domain. However in case of cgroup writeback this is not the right thing to
do. Consider the following domain hierarchy:

global domain (> 20G)
/ \
cgroup1 (10G) cgroup2 (10G)
| |
bdi wb1 wb2

and assume wb1 and wb2 have the same bandwidth and the background threshold
is set at 10%. The bg_thresh of cgroup1 and cgroup2 is going to be 1G. Now
because wb_calc_thresh(mdtc->wb, mdtc->bg_thresh) calculates per-wb
threshold in the global domain as (wb bandwidth) / (domain bandwidth) it
returns bg_thresh for wb1 as 0.5G although it has nobody to compete against
in cgroup1.

Fix the problem by calculating wb's share of bg_thresh in the cgroup
domain.

> Test as following:
> /* make it easier to observe the issue */
> echo 300000 > /proc/sys/vm/dirty_expire_centisecs
> echo 100 > /proc/sys/vm/dirty_writeback_centisecs
>
> /* run fio in wb1 */
> cd /sys/fs/cgroup
> echo "+memory +io" > cgroup.subtree_control
> mkdir group1
> cd group1
> echo 10G > memory.high
> echo 10G > memory.max
> echo $$ > cgroup.procs
> mkfs.ext4 -F /dev/vdb
> mount /dev/vdb /bdi1/
> fio -name test -filename=/bdi1/file -size=600M -ioengine=libaio -bs=4K \
> -iodepth=1 -rw=write -direct=0 --time_based -runtime=600 -invalidate=0
>
> /* run fio in wb2 with a new shell */
> cd /sys/fs/cgroup
> mkdir group2
> cd group2
> echo 10G > memory.high
> echo 10G > memory.max
> echo $$ > cgroup.procs
> mkfs.ext4 -F /dev/vdc
> mount /dev/vdc /bdi2/
> fio -name test -filename=/bdi2/file -size=600M -ioengine=libaio -bs=4K \
> -iodepth=1 -rw=write -direct=0 --time_based -runtime=600 -invalidate=0
>
> Before fix, the wrttien pages of wb1 and wb2 reported from
> toos/writeback/wb_monitor.py keep growing. After fix, rare written pages
> are accumulated.
> There is no obvious change in fio result.
>
> Fixes: 74d369443325 ("writeback: Fix performance regression in wb_over_bg_thresh()")
> Signed-off-by: Kemeng Shi <[email protected]>

Besides the changelog rephrasing the change looks good. Feel free to add:

Reviewed-by: Jan Kara <[email protected]>

Honza

> ---
> mm/page-writeback.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 2a3b68aae336..14893b20d38c 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -2137,7 +2137,7 @@ bool wb_over_bg_thresh(struct bdi_writeback *wb)
> if (mdtc->dirty > mdtc->bg_thresh)
> return true;
>
> - thresh = wb_calc_thresh(mdtc->wb, mdtc->bg_thresh);
> + thresh = __wb_calc_thresh(mdtc, mdtc->bg_thresh);
> if (thresh < 2 * wb_stat_error())
> reclaimable = wb_stat_sum(wb, WB_RECLAIMABLE);
> else
> --
> 2.30.0
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2024-05-07 01:17:17

by Kemeng Shi

[permalink] [raw]

Subject: Re: [PATCH v2 2/4] mm: correct calculation of wb's bg_thresh in cgroup domain

Hi Jan,
on 5/3/2024 5:30 PM, Jan Kara wrote:
> On Thu 25-04-24 21:17:22, Kemeng Shi wrote:
>> The wb_calc_thresh is supposed to calculate wb's share of bg_thresh in
>> global domain. To calculate wb's share of bg_thresh in cgroup domain,
>> it's more reasonable to use __wb_calc_thresh in which way we calculate
>> dirty_thresh in cgroup domain in balance_dirty_pages().
>>
>> Consider following domain hierarchy:
>> global domain (> 20G)
>> / \
>> cgroup domain1(10G) cgroup domain2(10G)
>> | |
>> bdi wb1 wb2
>> Assume wb1 and wb2 has the same bandwidth.
>> We have global domain bg_thresh > 2G, cgroup domain bg_thresh 1G.
>> Then we have:
>> wb's thresh in global domain = 2G * (wb bandwidth) / (system bandwidth)
>> = 2G * 1/2 = 1G
>> wb's thresh in cgroup domain = 1G * (wb bandwidth) / (system bandwidth)
>> = 1G * 1/2 = 0.5G
>> At last, wb1 and wb2 will be limited at 0.5G, the system will be limited
>> at 1G which is less than global domain bg_thresh 2G.
>
> This was a bit hard to understand for me so I'd rephrase it as:
>
> wb_calc_thresh() is calculating wb's share of bg_thresh in the global
> domain. However in case of cgroup writeback this is not the right thing to
> do. Consider the following domain hierarchy:
>
> global domain (> 20G)
> / \
> cgroup1 (10G) cgroup2 (10G)
> | |
> bdi wb1 wb2
>
> and assume wb1 and wb2 have the same bandwidth and the background threshold
> is set at 10%. The bg_thresh of cgroup1 and cgroup2 is going to be 1G. Now
> because wb_calc_thresh(mdtc->wb, mdtc->bg_thresh) calculates per-wb
> threshold in the global domain as (wb bandwidth) / (domain bandwidth) it
> returns bg_thresh for wb1 as 0.5G although it has nobody to compete against
> in cgroup1.
>
> Fix the problem by calculating wb's share of bg_thresh in the cgroup
> domain.
Thanks for improving the changelog. As this was merged into -mm and
mm-unstable tree, I'm not sure if a new patch is needed. If there is
anything I should do, please let me konw. Thanks.

>
>> Test as following:
>> /* make it easier to observe the issue */
>> echo 300000 > /proc/sys/vm/dirty_expire_centisecs
>> echo 100 > /proc/sys/vm/dirty_writeback_centisecs
>>
>> /* run fio in wb1 */
>> cd /sys/fs/cgroup
>> echo "+memory +io" > cgroup.subtree_control
>> mkdir group1
>> cd group1
>> echo 10G > memory.high
>> echo 10G > memory.max
>> echo $$ > cgroup.procs
>> mkfs.ext4 -F /dev/vdb
>> mount /dev/vdb /bdi1/
>> fio -name test -filename=/bdi1/file -size=600M -ioengine=libaio -bs=4K \
>> -iodepth=1 -rw=write -direct=0 --time_based -runtime=600 -invalidate=0
>>
>> /* run fio in wb2 with a new shell */
>> cd /sys/fs/cgroup
>> mkdir group2
>> cd group2
>> echo 10G > memory.high
>> echo 10G > memory.max
>> echo $$ > cgroup.procs
>> mkfs.ext4 -F /dev/vdc
>> mount /dev/vdc /bdi2/
>> fio -name test -filename=/bdi2/file -size=600M -ioengine=libaio -bs=4K \
>> -iodepth=1 -rw=write -direct=0 --time_based -runtime=600 -invalidate=0
>>
>> Before fix, the wrttien pages of wb1 and wb2 reported from
>> toos/writeback/wb_monitor.py keep growing. After fix, rare written pages
>> are accumulated.
>> There is no obvious change in fio result.
>>
>> Fixes: 74d369443325 ("writeback: Fix performance regression in wb_over_bg_thresh()")
>> Signed-off-by: Kemeng Shi <[email protected]>
>
> Besides the changelog rephrasing the change looks good. Feel free to add:
>
> Reviewed-by: Jan Kara <[email protected]>
>
> Honza
>
>> ---
>> mm/page-writeback.c | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
>> index 2a3b68aae336..14893b20d38c 100644
>> --- a/mm/page-writeback.c
>> +++ b/mm/page-writeback.c
>> @@ -2137,7 +2137,7 @@ bool wb_over_bg_thresh(struct bdi_writeback *wb)
>> if (mdtc->dirty > mdtc->bg_thresh)
>> return true;
>>
>> - thresh = wb_calc_thresh(mdtc->wb, mdtc->bg_thresh);
>> + thresh = __wb_calc_thresh(mdtc, mdtc->bg_thresh);
>> if (thresh < 2 * wb_stat_error())
>> reclaimable = wb_stat_sum(wb, WB_RECLAIMABLE);
>> else
>> --
>> 2.30.0
>>

2024-05-07 13:37:26

by Jan Kara

[permalink] [raw]

Subject: Re: [PATCH v2 2/4] mm: correct calculation of wb's bg_thresh in cgroup domain

On Tue 07-05-24 09:16:39, Kemeng Shi wrote:
>
> Hi Jan,
> on 5/3/2024 5:30 PM, Jan Kara wrote:
> > On Thu 25-04-24 21:17:22, Kemeng Shi wrote:
> >> The wb_calc_thresh is supposed to calculate wb's share of bg_thresh in
> >> global domain. To calculate wb's share of bg_thresh in cgroup domain,
> >> it's more reasonable to use __wb_calc_thresh in which way we calculate
> >> dirty_thresh in cgroup domain in balance_dirty_pages().
> >>
> >> Consider following domain hierarchy:
> >> global domain (> 20G)
> >> / \
> >> cgroup domain1(10G) cgroup domain2(10G)
> >> | |
> >> bdi wb1 wb2
> >> Assume wb1 and wb2 has the same bandwidth.
> >> We have global domain bg_thresh > 2G, cgroup domain bg_thresh 1G.
> >> Then we have:
> >> wb's thresh in global domain = 2G * (wb bandwidth) / (system bandwidth)
> >> = 2G * 1/2 = 1G
> >> wb's thresh in cgroup domain = 1G * (wb bandwidth) / (system bandwidth)
> >> = 1G * 1/2 = 0.5G
> >> At last, wb1 and wb2 will be limited at 0.5G, the system will be limited
> >> at 1G which is less than global domain bg_thresh 2G.
> >
> > This was a bit hard to understand for me so I'd rephrase it as:
> >
> > wb_calc_thresh() is calculating wb's share of bg_thresh in the global
> > domain. However in case of cgroup writeback this is not the right thing to
> > do. Consider the following domain hierarchy:
> >
> > global domain (> 20G)
> > / \
> > cgroup1 (10G) cgroup2 (10G)
> > | |
> > bdi wb1 wb2
> >
> > and assume wb1 and wb2 have the same bandwidth and the background threshold
> > is set at 10%. The bg_thresh of cgroup1 and cgroup2 is going to be 1G. Now
> > because wb_calc_thresh(mdtc->wb, mdtc->bg_thresh) calculates per-wb
> > threshold in the global domain as (wb bandwidth) / (domain bandwidth) it
> > returns bg_thresh for wb1 as 0.5G although it has nobody to compete against
> > in cgroup1.
> >
> > Fix the problem by calculating wb's share of bg_thresh in the cgroup
> > domain.
> Thanks for improving the changelog. As this was merged into -mm and
> mm-unstable tree, I'm not sure if a new patch is needed. If there is
> anything I should do, please let me konw. Thanks.

No need to do anything here. Andrew has picked up these updates.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR