LinuxLists.cc - [PATCH 0/5] Fix and cleanups to page-writeback

2024-01-23 10:36:05

Subject: [PATCH 0/5] Fix and cleanups to page-writeback

This series contains some random cleanups and a fix to correct
calculation of cgroup wb's bg_thresh. More details can be found
respective patches. Thanks!

Kemeng Shi (5):
mm: enable __wb_calc_thresh to calculate dirty background threshold
mm: correct calculation of cgroup wb's bg_thresh in wb_over_bg_thresh
mm: call __wb_calc_thresh instead of wb_calc_thresh in
wb_over_bg_thresh
mm: remove redundant check in wb_min_max_ratio
mm: remove stale comment __folio_mark_dirty

mm/page-writeback.c | 43 +++++++++++++++++++++----------------------
1 file changed, 21 insertions(+), 22 deletions(-)

--
2.30.0

2024-01-23 10:36:27

by Kemeng Shi

[permalink] [raw]

Subject: [PATCH 2/5] mm: correct calculation of cgroup wb's bg_thresh in wb_over_bg_thresh

The wb_calc_thresh will calculate wb's share in global wb domain. We need
to wb's share in mem_cgroup_wb_domain for mdtc. Call __wb_calc_thresh
instead of wb_calc_thresh to fix this.

Fixes: 74d369443325 ("writeback: Fix performance regression in wb_over_bg_thresh()")
Signed-off-by: Kemeng Shi <[email protected]>
---
mm/page-writeback.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 9268859722c4..f6c7f3b0f495 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2118,7 +2118,7 @@ bool wb_over_bg_thresh(struct bdi_writeback *wb)
if (mdtc->dirty > mdtc->bg_thresh)
return true;

- thresh = wb_calc_thresh(mdtc->wb, mdtc->bg_thresh);
+ thresh = __wb_calc_thresh(mdtc, mdtc->bg_thresh);
if (thresh < 2 * wb_stat_error())
reclaimable = wb_stat_sum(wb, WB_RECLAIMABLE);
else
--
2.30.0

2024-01-23 10:36:27

by Kemeng Shi

[permalink] [raw]

Subject: [PATCH 1/5] mm: enable __wb_calc_thresh to calculate dirty background threshold

Originally, __wb_calc_thresh always calculate wb's share of dirty
throttling threshold. By getting thresh of wb_domain from caller,
__wb_calc_thresh could be used for both dirty throttling and dirty
background threshold.

This is a preparation to correct threshold calculation of wb in cgroup.

Signed-off-by: Kemeng Shi <[email protected]>
---
mm/page-writeback.c | 31 +++++++++++++++++--------------
1 file changed, 17 insertions(+), 14 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index cd4e4ae77c40..9268859722c4 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -838,13 +838,15 @@ static void mdtc_calc_avail(struct dirty_throttle_control *mdtc,
}

/**
- * __wb_calc_thresh - @wb's share of dirty throttling threshold
+ * __wb_calc_thresh - @wb's share of dirty threshold
* @dtc: dirty_throttle_context of interest
+ * @thresh: dirty throttling or dirty background threshold of wb_domain in @dtc
*
- * Note that balance_dirty_pages() will only seriously take it as a hard limit
- * when sleeping max_pause per page is not enough to keep the dirty pages under
- * control. For example, when the device is completely stalled due to some error
- * conditions, or when there are 1000 dd tasks writing to a slow 10MB/s USB key.
+ * Note that balance_dirty_pages() will only seriously take dirty throttling
+ * threshold as a hard limit when sleeping max_pause per page is not enough
+ * to keep the dirty pages under control. For example, when the device is
+ * completely stalled due to some error conditions, or when there are 1000
+ * dd tasks writing to a slow 10MB/s USB key.
* In the other normal situations, it acts more gently by throttling the tasks
* more (rather than completely block them) when the wb dirty pages go high.
*
@@ -855,19 +857,20 @@ static void mdtc_calc_avail(struct dirty_throttle_control *mdtc,
* The wb's share of dirty limit will be adapting to its throughput and
* bounded by the bdi->min_ratio and/or bdi->max_ratio parameters, if set.
*
- * Return: @wb's dirty limit in pages. The term "dirty" in the context of
- * dirty balancing includes all PG_dirty and PG_writeback pages.
+ * Return: @wb's dirty limit in pages. For dirty throttling limit, the term
+ * "dirty" in the context of dirty balancing includes all PG_dirty and
+ * PG_writeback pages.
*/
-static unsigned long __wb_calc_thresh(struct dirty_throttle_control *dtc)
+static unsigned long __wb_calc_thresh(struct dirty_throttle_control *dtc,
+ unsigned long thresh)
{
struct wb_domain *dom = dtc_dom(dtc);
- unsigned long thresh = dtc->thresh;
u64 wb_thresh;
unsigned long numerator, denominator;
unsigned long wb_min_ratio, wb_max_ratio;

/*
- * Calculate this BDI's share of the thresh ratio.
+ * Calculate this wb's share of the thresh ratio.
*/
fprop_fraction_percpu(&dom->completions, dtc->wb_completions,
&numerator, &denominator);
@@ -887,9 +890,9 @@ static unsigned long __wb_calc_thresh(struct dirty_throttle_control *dtc)

unsigned long wb_calc_thresh(struct bdi_writeback *wb, unsigned long thresh)
{
- struct dirty_throttle_control gdtc = { GDTC_INIT(wb),
- .thresh = thresh };
- return __wb_calc_thresh(&gdtc);
+ struct dirty_throttle_control gdtc = { GDTC_INIT(wb) };
+
+ return __wb_calc_thresh(&gdtc, thresh);
}

/*
@@ -1636,7 +1639,7 @@ static inline void wb_dirty_limits(struct dirty_throttle_control *dtc)
* wb_position_ratio() will let the dirtier task progress
* at some rate <= (write_bw / 2) for bringing down wb_dirty.
*/
- dtc->wb_thresh = __wb_calc_thresh(dtc);
+ dtc->wb_thresh = __wb_calc_thresh(dtc, dtc->thresh);
dtc->wb_bg_thresh = dtc->thresh ?
div_u64((u64)dtc->wb_thresh * dtc->bg_thresh, dtc->thresh) : 0;

--
2.30.0

2024-01-23 10:36:35

by Kemeng Shi

[permalink] [raw]

Subject: [PATCH 3/5] mm: call __wb_calc_thresh instead of wb_calc_thresh in wb_over_bg_thresh

Call __wb_calc_thresh directly to remove unnecessary wrap of
wb_calc_thresh.

Signed-off-by: Kemeng Shi <[email protected]>
---
mm/page-writeback.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index f6c7f3b0f495..5c19ebffe5be 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2098,7 +2098,7 @@ bool wb_over_bg_thresh(struct bdi_writeback *wb)
if (gdtc->dirty > gdtc->bg_thresh)
return true;

- thresh = wb_calc_thresh(gdtc->wb, gdtc->bg_thresh);
+ thresh = __wb_calc_thresh(gdtc, gdtc->bg_thresh);
if (thresh < 2 * wb_stat_error())
reclaimable = wb_stat_sum(wb, WB_RECLAIMABLE);
else
--
2.30.0

2024-01-23 10:36:56

by Kemeng Shi

[permalink] [raw]

Subject: [PATCH 4/5] mm: remove redundant check in wb_min_max_ratio

We initialize bdi->max_ratio with a valid value (100 * BDI_RATIO_SCALE)
in bdi_init and we always set bdi->max_ratio with a valid value (< 100 *
BDI_RATIO_SCALE) in __bdi_set_max_ratio. So the validation of max_ratio
in wb_min_max_ratio is redundant. Just remove it.

Signed-off-by: Kemeng Shi <[email protected]>
---
mm/page-writeback.c | 6 ++----
1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 5c19ebffe5be..f25393034c76 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -198,10 +198,8 @@ static void wb_min_max_ratio(struct bdi_writeback *wb,
min *= this_bw;
min = div64_ul(min, tot_bw);
}
- if (max < 100 * BDI_RATIO_SCALE) {
- max *= this_bw;
- max = div64_ul(max, tot_bw);
- }
+ max *= this_bw;
+ max = div64_ul(max, tot_bw);
}

*minp = min;
--
2.30.0

2024-01-23 10:37:03

by Kemeng Shi

[permalink] [raw]

Subject: [PATCH 5/5] mm: remove stale comment __folio_mark_dirty

The __folio_mark_dirty will not mark inode dirty any longer. Remove the
stale comment of it.

Signed-off-by: Kemeng Shi <[email protected]>
---
mm/page-writeback.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index f25393034c76..03e2797bc983 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2647,8 +2647,7 @@ void folio_account_cleaned(struct folio *folio, struct bdi_writeback *wb)
}

/*
- * Mark the folio dirty, and set it dirty in the page cache, and mark
- * the inode dirty.
+ * Mark the folio dirty, and set it dirty in the page cache.
*
* If warn is true, then emit a warning if the folio is not uptodate and has
* not been truncated.
--
2.30.0

2024-01-23 17:23:57

by Matthew Wilcox

[permalink] [raw]

Subject: Re: [PATCH 5/5] mm: remove stale comment __folio_mark_dirty

On Wed, Jan 24, 2024 at 02:33:32AM +0800, Kemeng Shi wrote:
> The __folio_mark_dirty will not mark inode dirty any longer. Remove the
> stale comment of it.
>
> Signed-off-by: Kemeng Shi <[email protected]>

Reviewed-by: Matthew Wilcox (Oracle) <[email protected]>

2024-01-23 20:43:52

by Tejun Heo

[permalink] [raw]

Subject: Re: [PATCH 2/5] mm: correct calculation of cgroup wb's bg_thresh in wb_over_bg_thresh

On Wed, Jan 24, 2024 at 02:33:29AM +0800, Kemeng Shi wrote:
> The wb_calc_thresh will calculate wb's share in global wb domain. We need
> to wb's share in mem_cgroup_wb_domain for mdtc. Call __wb_calc_thresh
> instead of wb_calc_thresh to fix this.

That function is calculating the wb's portion of wb portion in the whole
system so that threshold can be distributed accordingly. So, it has to be
compared in the global domain. If you look at the comment on top of struct
wb_domain, it says:

/*
* A wb_domain represents a domain that wb's (bdi_writeback's) belong to
* and are measured against each other in. There always is one global
* domain, global_wb_domain, that every wb in the system is a member of.
* This allows measuring the relative bandwidth of each wb to distribute
* dirtyable memory accordingly.
*/

Also, how is this tested? Was there a case where the existing code
misbehaved that's improved by this patch? Or is this just from reading code?

Thanks.

--
tejun

2024-01-23 20:46:47

by Tejun Heo

[permalink] [raw]

Subject: Re: [PATCH 0/5] Fix and cleanups to page-writeback

On Wed, Jan 24, 2024 at 02:33:27AM +0800, Kemeng Shi wrote:
> This series contains some random cleanups and a fix to correct
> calculation of cgroup wb's bg_thresh. More details can be found
> respective patches. Thanks!
>
> Kemeng Shi (5):
> mm: enable __wb_calc_thresh to calculate dirty background threshold
> mm: correct calculation of cgroup wb's bg_thresh in wb_over_bg_thresh
> mm: call __wb_calc_thresh instead of wb_calc_thresh in
> wb_over_bg_thresh
> mm: remove redundant check in wb_min_max_ratio
> mm: remove stale comment __folio_mark_dirty

I don't have the fifth patch in my inbox and I find the patchset difficult
to review because it's changing subtle behaviors without sufficient
justifications or concrete cases which motivated the changes. For now, I
think it'd be prudent to not apply this patchset.

Thanks.

--
tejun

2024-01-24 02:02:30

by Kemeng Shi

[permalink] [raw]

Subject: Re: [PATCH 2/5] mm: correct calculation of cgroup wb's bg_thresh in wb_over_bg_thresh

on 1/24/2024 4:43 AM, Tejun Heo wrote:
> On Wed, Jan 24, 2024 at 02:33:29AM +0800, Kemeng Shi wrote:
>> The wb_calc_thresh will calculate wb's share in global wb domain. We need
>> to wb's share in mem_cgroup_wb_domain for mdtc. Call __wb_calc_thresh
>> instead of wb_calc_thresh to fix this.
>
> That function is calculating the wb's portion of wb portion in the whole
> system so that threshold can be distributed accordingly. So, it has to be
> compared in the global domain. If you look at the comment on top of struct
> wb_domain, it says:
>
> /*
> * A wb_domain represents a domain that wb's (bdi_writeback's) belong to
> * and are measured against each other in. There always is one global
> * domain, global_wb_domain, that every wb in the system is a member of.
> * This allows measuring the relative bandwidth of each wb to distribute
> * dirtyable memory accordingly.
> */
>
Hi Tejun, thanks for reply. For cgroup wb, it will belongs to a global wb
domain and a cgroup domain. I agree the way how we calculate wb's threshold
in global domain as you described above. This patch tries to fix calculation
of wb's threshold in cgroup domain which now is wb_calc_thresh(mdtc->wb,
mdtc->bg_thresh)), means:
(wb bandwidth) / (system bandwidth) * (*cgroup domain threshold*)
The cgroup domain threshold is
(memory of cgroup domain) / (memory of system) * (system threshold).
Then the wb's threshold in cgroup will be smaller than expected.

Consider following domain hierarchy:
global domain (100G)
/ \
cgroup domain1(50G) cgroup domain2(50G)
| |
bdi wb1 wb2
Assume wb1 and wb2 has the same bandwidth.
We have global domain bg_thresh 10G, cgroup domain bg_thresh 5G.
Then we have:
wb's thresh in global domain = 10G * (wb bandwidth) / (system bandwidth)
= 10G * 1/2 = 5G
wb's thresh in cgroup domain = 5G * (wb bandwidth) / (system bandwidth)
= 5G * 1/2 = 2.5G
At last, wb1 and wb2 will be limited at 2.5G, the system will be limited
at 5G which is less than global domain bg_thresh 10G.

After the fix, threshold in cgroup domain will be:
(wb bandwidth) / (cgroup bandwidth) * (cgroup domain threshold)
The wb1 and wb2 will be limited at 5G, the system will be limited at
10G which equals to global domain bg_thresh 10G.

As I didn't take a deep look into memory cgroup, please correct me if
anything is wrong. Thanks!
> Also, how is this tested? Was there a case where the existing code
> misbehaved that's improved by this patch? Or is this just from reading code?
This is jut from reading code. Would the case showed above convince you
a bit. Look forward to your reply, thanks!.
>
> Thanks.
>

2024-01-29 21:00:35

by Tejun Heo

[permalink] [raw]

Subject: Re: [PATCH 2/5] mm: correct calculation of cgroup wb's bg_thresh in wb_over_bg_thresh

Hello,

On Wed, Jan 24, 2024 at 10:01:47AM +0800, Kemeng Shi wrote:
> Hi Tejun, thanks for reply. For cgroup wb, it will belongs to a global wb
> domain and a cgroup domain. I agree the way how we calculate wb's threshold
> in global domain as you described above. This patch tries to fix calculation
> of wb's threshold in cgroup domain which now is wb_calc_thresh(mdtc->wb,
> mdtc->bg_thresh)), means:
> (wb bandwidth) / (system bandwidth) * (*cgroup domain threshold*)
> The cgroup domain threshold is
> (memory of cgroup domain) / (memory of system) * (system threshold).
> Then the wb's threshold in cgroup will be smaller than expected.
>
> Consider following domain hierarchy:
> global domain (100G)
> / \
> cgroup domain1(50G) cgroup domain2(50G)
> | |
> bdi wb1 wb2
> Assume wb1 and wb2 has the same bandwidth.
> We have global domain bg_thresh 10G, cgroup domain bg_thresh 5G.
> Then we have:
> wb's thresh in global domain = 10G * (wb bandwidth) / (system bandwidth)
> = 10G * 1/2 = 5G
> wb's thresh in cgroup domain = 5G * (wb bandwidth) / (system bandwidth)
> = 5G * 1/2 = 2.5G
> At last, wb1 and wb2 will be limited at 2.5G, the system will be limited
> at 5G which is less than global domain bg_thresh 10G.
>
> After the fix, threshold in cgroup domain will be:
> (wb bandwidth) / (cgroup bandwidth) * (cgroup domain threshold)
> The wb1 and wb2 will be limited at 5G, the system will be limited at
> 10G which equals to global domain bg_thresh 10G.
>
> As I didn't take a deep look into memory cgroup, please correct me if
> anything is wrong. Thanks!
> > Also, how is this tested? Was there a case where the existing code
> > misbehaved that's improved by this patch? Or is this just from reading code?
>
> This is jut from reading code. Would the case showed above convince you
> a bit. Look forward to your reply, thanks!.

So, the explanation makes some sense to me but can you please construct a
case to actually demonstrate the problem and fix? I don't think it'd be wise
to apply the change without actually observing the code change does what it
says it does.

Thanks.

--
tejun

2024-02-08 09:40:33

by Kemeng Shi

[permalink] [raw]

Subject: Re: [PATCH 2/5] mm: correct calculation of cgroup wb's bg_thresh in wb_over_bg_thresh

on 1/30/2024 5:00 AM, Tejun Heo wrote:
> Hello,
>
> On Wed, Jan 24, 2024 at 10:01:47AM +0800, Kemeng Shi wrote:
>> Hi Tejun, thanks for reply. For cgroup wb, it will belongs to a global wb
>> domain and a cgroup domain. I agree the way how we calculate wb's threshold
>> in global domain as you described above. This patch tries to fix calculation
>> of wb's threshold in cgroup domain which now is wb_calc_thresh(mdtc->wb,
>> mdtc->bg_thresh)), means:
>> (wb bandwidth) / (system bandwidth) * (*cgroup domain threshold*)
>> The cgroup domain threshold is
>> (memory of cgroup domain) / (memory of system) * (system threshold).
>> Then the wb's threshold in cgroup will be smaller than expected.
>>
>> Consider following domain hierarchy:
>> global domain (100G)
>> / \
>> cgroup domain1(50G) cgroup domain2(50G)
>> | |
>> bdi wb1 wb2
>> Assume wb1 and wb2 has the same bandwidth.
>> We have global domain bg_thresh 10G, cgroup domain bg_thresh 5G.
>> Then we have:
>> wb's thresh in global domain = 10G * (wb bandwidth) / (system bandwidth)
>> = 10G * 1/2 = 5G
>> wb's thresh in cgroup domain = 5G * (wb bandwidth) / (system bandwidth)
>> = 5G * 1/2 = 2.5G
>> At last, wb1 and wb2 will be limited at 2.5G, the system will be limited
>> at 5G which is less than global domain bg_thresh 10G.
>>
>> After the fix, threshold in cgroup domain will be:
>> (wb bandwidth) / (cgroup bandwidth) * (cgroup domain threshold)
>> The wb1 and wb2 will be limited at 5G, the system will be limited at
>> 10G which equals to global domain bg_thresh 10G.
>>
>> As I didn't take a deep look into memory cgroup, please correct me if
>> anything is wrong. Thanks!
>>> Also, how is this tested? Was there a case where the existing code
>>> misbehaved that's improved by this patch? Or is this just from reading code?
>>
>> This is jut from reading code. Would the case showed above convince you
>> a bit. Look forward to your reply, thanks!.
>
> So, the explanation makes some sense to me but can you please construct a
> case to actually demonstrate the problem and fix? I don't think it'd be wise
> to apply the change without actually observing the code change does what it
> says it does.
Hi Tejun, sorry for the delay as I found there is a issue that keep triggering
writeback even the dirty page is under dirty background threshold. The issue
make it difficult to observe the expected improvment from this patch. I try to
fix it in [1] and test this patch based on the fix patches.
Run test as following:

/* make background writeback easier to observe */
echo 300000 > /proc/sys/vm/dirty_expire_centisecs
echo 100 > /proc/sys/vm/dirty_writeback_centisecs
/* enable memory and io cgroup */
echo "+memory +io" > /sys/fs/cgroup/cgroup.subtree_control

/* run fio in group1 with shell */
cd /sys/fs/cgroup
mkdir group1
cd group1
echo 10G > memory.high
echo 10G > memory.max
echo $$ > cgroup.procs
mkfs.ext4 -F /dev/vdb
mount /dev/vdb /bdi1/
fio -name test -filename=/bdi1/file -size=800M -ioengine=libaio -bs=4K -iodepth=1 -rw=write -direct=0 --time_based -runtime=60 -invalidate=0

/* run another fio in group2 with another shell */
cd /sys/fs/cgroup
mkdir group2
cd group2
echo 10G > memory.high
echo 10G > memory.max
echo $$ > cgroup.procs
mkfs.ext4 -F /dev/vdc
mount /dev/vdc /bdi2/
fio -name test -filename=/bdi2/file -size=800M -ioengine=libaio -bs=4K -iodepth=1 -rw=write -direct=0 --time_based -runtime=60 -invalidate=0

Before the fix we got (result of three tests):
fio1
WRITE: bw=1304MiB/s (1367MB/s), 1304MiB/s-1304MiB/s (1367MB/s-1367MB/s), io=76.4GiB (82.0GB), run=60001-60001msec
WRITE: bw=1351MiB/s (1417MB/s), 1351MiB/s-1351MiB/s (1417MB/s-1417MB/s), io=79.2GiB (85.0GB), run=60001-60001msec
WRITE: bw=1373MiB/s (1440MB/s), 1373MiB/s-1373MiB/s (1440MB/s-1440MB/s), io=80.5GiB (86.4GB), run=60001-60001msec
fio2
WRITE: bw=1134MiB/s (1190MB/s), 1134MiB/s-1134MiB/s (1190MB/s-1190MB/s), io=66.5GiB (71.4GB), run=60001-60001msec
WRITE: bw=1414MiB/s (1483MB/s), 1414MiB/s-1414MiB/s (1483MB/s-1483MB/s), io=82.8GiB (88.0GB), run=60001-60001msec
WRITE: bw=1469MiB/s (1540MB/s), 1469MiB/s-1469MiB/s (1540MB/s-1540MB/s), io=86.0GiB (92.4GB), run=60001-60001msec

After the fix we got (result of three tests):
fio1
WRITE: bw=1719MiB/s (1802MB/s), 1719MiB/s-1719MiB/s (1802MB/s-1802MB/s), io=101GiB (108GB), run=60001-60001msec
WRITE: bw=1723MiB/s (1806MB/s), 1723MiB/s-1723MiB/s (1806MB/s-1806MB/s), io=101GiB (108GB), run=60001-60001msec
WRITE: bw=1691MiB/s (1774MB/s), 1691MiB/s-1691MiB/s (1774MB/s-1774MB/s), io=99.2GiB (106GB), run=60036-60036msec
fio2
WRITE: bw=1692MiB/s (1774MB/s), 1692MiB/s-1692MiB/s (1774MB/s-1774MB/s), io=99.1GiB (106GB), run=60001-60001msec
WRITE: bw=1681MiB/s (1763MB/s), 1681MiB/s-1681MiB/s (1763MB/s-1763MB/s), io=98.5GiB (106GB), run=60001-60001msec
WRITE: bw=1671MiB/s (1752MB/s), 1671MiB/s-1671MiB/s (1752MB/s-1752MB/s), io=97.9GiB (105GB), run=60001-60001msec
I also add code to print the pages written in writeback and pages written in
writeback reduce a lot and are rare after this fix.

[1] https://lore.kernel.org/linux-fsdevel/[email protected]/T/#u
>
> Thanks.
>

2024-02-08 19:32:26

by Tejun Heo

[permalink] [raw]

Subject: Re: [PATCH 2/5] mm: correct calculation of cgroup wb's bg_thresh in wb_over_bg_thresh

Hello, Kemeng.

On Thu, Feb 08, 2024 at 05:26:10PM +0800, Kemeng Shi wrote:
> Hi Tejun, sorry for the delay as I found there is a issue that keep triggering
> writeback even the dirty page is under dirty background threshold. The issue
> make it difficult to observe the expected improvment from this patch. I try to
> fix it in [1] and test this patch based on the fix patches.
> Run test as following:

Ah, that looks promising and thanks a lot for looking into this. It's great
to have someone actually poring over the code and behavior. Understanding
the wb and cgroup wb behaviors have always been challenging because the only
thing we have is the tracepoints and it's really tedious and difficult to
build an overall understanding from the trace outputs. Can I persuade you
into writing a drgn monitoring script similar to e.g.
tools/workqueues/wq_monitor.py? I think there's a pretty good chance the
visibility can be improved substantially.

Thanks.

--
tejun

2024-02-18 02:36:22

by Kemeng Shi

[permalink] [raw]

Subject: Re: [PATCH 2/5] mm: correct calculation of cgroup wb's bg_thresh in wb_over_bg_thresh

on 2/9/2024 3:32 AM, Tejun Heo wrote:
> Hello, Kemeng.
>
> On Thu, Feb 08, 2024 at 05:26:10PM +0800, Kemeng Shi wrote:
>> Hi Tejun, sorry for the delay as I found there is a issue that keep triggering
>> writeback even the dirty page is under dirty background threshold. The issue
>> make it difficult to observe the expected improvment from this patch. I try to
>> fix it in [1] and test this patch based on the fix patches.
>> Run test as following:
>
> Ah, that looks promising and thanks a lot for looking into this. It's great
> to have someone actually poring over the code and behavior. Understanding
> the wb and cgroup wb behaviors have always been challenging because the only
> thing we have is the tracepoints and it's really tedious and difficult to
> build an overall understanding from the trace outputs. Can I persuade you
> into writing a drgn monitoring script similar to e.g.
> tools/workqueues/wq_monitor.py? I think there's a pretty good chance the
> visibility can be improved substantially.
Hi Tejun, sorry for the late reply as I was on vacation these days.
I agree that visibility is poor so I have to add some printks to debug.
Actually, I have added per wb stats to improve the visibility as we
only have per bdi stats (/sys/kernel/debug/bdi/xxx:x/stats) now. And I
plan to submit it in a new series.
I'd like to add a script to improve visibility more but I can't ensure
the time to do it. I would submit the monitoring script with the per wb
stats if the time problem does not bother you.
Thanks.
>
> Thanks.
>

2024-02-20 17:37:52

by Tejun Heo

[permalink] [raw]

Subject: Re: [PATCH 2/5] mm: correct calculation of cgroup wb's bg_thresh in wb_over_bg_thresh

Hello,

On Sun, Feb 18, 2024 at 10:35:41AM +0800, Kemeng Shi wrote:
> Hi Tejun, sorry for the late reply as I was on vacation these days.
> I agree that visibility is poor so I have to add some printks to debug.
> Actually, I have added per wb stats to improve the visibility as we
> only have per bdi stats (/sys/kernel/debug/bdi/xxx:x/stats) now. And I
> plan to submit it in a new series.
> I'd like to add a script to improve visibility more but I can't ensure
> the time to do it. I would submit the monitoring script with the per wb
> stats if the time problem does not bother you.

It has had poor visiblity for many many years, I don't think we're in any
hurry.

Thanks.

--
tejun