LinuxLists.cc - [PATCH] bdi flusher should not be throttled here when it fall into buddy slow path

2016-10-18 07:20:28

Subject: [PATCH] bdi flusher should not be throttled here when it fall into buddy slow path

From: z00281421 <[email protected]>

bdi flusher may enter page alloc slow path due to writepage and kmalloc.
in that case the flusher as a direct reclaimer should not be throttled here
because it can not to reclaim clean file pages or anaonymous pages
for next moment; furthermore writeback rate of dirty pages would be
slow down and other direct reclaimers and kswapd would be affected.
bdi flusher should be iosceduled by get_request rather than here.

Signed-off-by: z00281421 <[email protected]>
---
fs/fs-writeback.c | 4 ++--
include/linux/sched.h | 1 +
mm/vmscan.c | 15 +++++++++++----
3 files changed, 14 insertions(+), 6 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 05713a5..f6bf067 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1908,7 +1908,7 @@ void wb_workfn(struct work_struct *work)
long pages_written;

set_worker_desc("flush-%s", dev_name(wb->bdi->dev));
- current->flags |= PF_SWAPWRITE;
+ current->flags |= (PF_SWAPWRITE | PF_BDI_FLUSHER | PF_LESS_THROTTLE);

if (likely(!current_is_workqueue_rescuer() ||
!test_bit(WB_registered, &wb->state))) {
@@ -1938,7 +1938,7 @@ void wb_workfn(struct work_struct *work)
else if (wb_has_dirty_io(wb) && dirty_writeback_interval)
wb_wakeup_delayed(wb);

- current->flags &= ~PF_SWAPWRITE;
+ current->flags &= ~(PF_SWAPWRITE | PF_BDI_FLUSHER | PF_LESS_THROTTLE);
}

/*
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 62c68e5..4bb70f2 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2232,6 +2232,7 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
#define PF_KTHREAD 0x00200000 /* I am a kernel thread */
#define PF_RANDOMIZE 0x00400000 /* randomize virtual address space */
#define PF_SWAPWRITE 0x00800000 /* Allowed to write to swap */
+#define PF_BDI_FLUSHER 0x01000000 /* I am bdi flusher */
#define PF_NO_SETAFFINITY 0x04000000 /* Userland is not allowed to meddle with cpus_allowed */
#define PF_MCE_EARLY 0x08000000 /* Early kill for mce process policy */
#define PF_MUTEX_TESTER 0x20000000 /* Thread belongs to the rt mutex tester */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 0fe8b71..492e9e7 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1643,12 +1643,19 @@ putback_inactive_pages(struct lruvec *lruvec, struct list_head *page_list)
* If a kernel thread (such as nfsd for loop-back mounts) services
* a backing device by writing to the page cache it sets PF_LESS_THROTTLE.
* In that case we should only throttle if the backing device it is
- * writing to is congested. In other cases it is safe to throttle.
+ * writing to is congested. another case is that bdi flusher could
+ * not be throttled here even though whose bdi is consgested.
+ * In other cases it is safe to throttle.
*/
-static int current_may_throttle(void)
+static bool current_may_throttle(void)
{
- return !(current->flags & PF_LESS_THROTTLE) ||
- current->backing_dev_info == NULL ||
+ if (!(current->flags & PF_LESS_THROTTLE))
+ return true;
+
+ if (current->flags & PF_BDI_FLUSHER)
+ return false;
+
+ return current->backing_dev_info == NULL ||
bdi_write_congested(current->backing_dev_info);
}

--
1.7.9.5

2016-10-18 09:35:21

by Hillf Danton

[permalink] [raw]

Subject: Re: [PATCH] bdi flusher should not be throttled here when it fall into buddy slow path

> @@ -1908,7 +1908,7 @@ void wb_workfn(struct work_struct *work)
> long pages_written;
>
> set_worker_desc("flush-%s", dev_name(wb->bdi->dev));
> - current->flags |= PF_SWAPWRITE;

If flags carries PF_LESS_THROTTLE before modified, then you
have to restore it.

> + current->flags |= (PF_SWAPWRITE | PF_BDI_FLUSHER | PF_LESS_THROTTLE);
>
> if (likely(!current_is_workqueue_rescuer() ||
> !test_bit(WB_registered, &wb->state))) {
> @@ -1938,7 +1938,7 @@ void wb_workfn(struct work_struct *work)
> else if (wb_has_dirty_io(wb) && dirty_writeback_interval)
> wb_wakeup_delayed(wb);
>
> - current->flags &= ~PF_SWAPWRITE;
> + current->flags &= ~(PF_SWAPWRITE | PF_BDI_FLUSHER | PF_LESS_THROTTLE);
> }
>
thanks
Hillf

2016-10-18 09:59:23

by Mel Gorman

[permalink] [raw]

Subject: Re: [PATCH] bdi flusher should not be throttled here when it fall into buddy slow path

On Tue, Oct 18, 2016 at 03:12:45PM +0800, [email protected] wrote:
> From: z00281421 <[email protected]>
>
> bdi flusher may enter page alloc slow path due to writepage and kmalloc.
> in that case the flusher as a direct reclaimer should not be throttled here
> because it can not to reclaim clean file pages or anaonymous pages
> for next moment; furthermore writeback rate of dirty pages would be
> slow down and other direct reclaimers and kswapd would be affected.
> bdi flusher should be iosceduled by get_request rather than here.
>
> Signed-off-by: z00281421 <[email protected]>

What does this patch do that PF_LESS_THROTTLE is not doing already if
there is an underlying BDI?

There have been a few patches like this recently that look like they might
do something useful but are subtle. They really should be accompanied by
a test case and data showing they either fix a functional issue (machine
livelocking due to writeback not making progress) or a performance issue.

--
Mel Gorman
SUSE Labs

2016-10-18 11:11:25

by zhouxianrong

[permalink] [raw]

Subject: Re: [PATCH] bdi flusher should not be throttled here when it fall into buddy slow path

Call trace:
[<ffffffc0000863dc>] __switch_to+0x80/0x98
[<ffffffc001160c58>] __schedule+0x314/0x854
[<ffffffc0011611e0>] schedule+0x48/0xa4
[<ffffffc0011648c4>] schedule_timeout+0x158/0x2c8
[<ffffffc0011608b4>] io_schedule_timeout+0xbc/0x14c
[<ffffffc0001aec84>] wait_iff_congested+0x1d4/0x1ec
[<ffffffc0001a36b0>] shrink_inactive_list+0x530/0x760
[<ffffffc0001a3e14>] shrink_lruvec+0x534/0x76c
[<ffffffc0001a40d4>] shrink_zone+0x88/0x1b8
[<ffffffc0001a4444>] do_try_to_free_pages+0x240/0x478
[<ffffffc0001a4788>] try_to_free_pages+0x10c/0x284
[<ffffffc0001968a4>] __alloc_pages_nodemask+0x540/0x918
[<ffffffc0001dd0e8>] new_slab+0x334/0x4a0
[<ffffffc0001df37c>] __slab_alloc.isra.75.constprop.77+0x6bc/0x780
[<ffffffc0001df584>] kmem_cache_alloc+0x144/0x23c
[<ffffffc00018f040>] mempool_alloc_slab+0x2c/0x38
[<ffffffc00018f1f4>] mempool_alloc+0x7c/0x188
[<ffffffc0003f462c>] bio_alloc_bioset+0x1cc/0x254
[<ffffffc00022a430>] _submit_bh+0x74/0x1c8
[<ffffffc00022c9d0>] __block_write_full_page.constprop.33+0x1a0/0x40c
[<ffffffc00022cd1c>] block_write_full_page+0xe0/0x134
[<ffffffc00022da64>] blkdev_writepage+0x30/0x3c
[<ffffffc000197d08>] __writepage+0x34/0x74
[<ffffffc000198880>] write_cache_pages+0x1e8/0x450
[<ffffffc000198b3c>] generic_writepages+0x54/0x8c
[<ffffffc00019a990>] do_writepages+0x40/0x6c
[<ffffffc00021e604>] __writeback_single_inode+0x60/0x51c
[<ffffffc00021eeec>] writeback_sb_inodes+0x2d4/0x46c
[<ffffffc00021f128>] __writeback_inodes_wb+0xa4/0xe8
[<ffffffc00021f480>] wb_writeback+0x314/0x3fc
[<ffffffc000220224>] bdi_writeback_workfn+0x130/0x4e0
[<ffffffc0000be4d4>] process_one_work+0x18c/0x51c
[<ffffffc0000bedd8>] worker_thread+0x15c/0x51c
[<ffffffc0000c5718>] kthread+0x10c/0x120

the above calltrace occured when write sdcard under large and long pressure.
the patch is a performance issue. i hope flusher do not be throttled just here and
let it reclaim the successive clean file pages or anonymous pages on lru list
and then return to write left dirty pages of inode. it would speed up write-back
speed of dirty pages. so other direct reclaimers can reclaim more clean pages.
in low memory caused by big pagecache bdi writeback speed play a key role.

On 2016/10/18 17:59, Mel Gorman wrote:
> On Tue, Oct 18, 2016 at 03:12:45PM +0800, [email protected] wrote:
>> From: z00281421 <[email protected]>
>>
>> bdi flusher may enter page alloc slow path due to writepage and kmalloc.
>> in that case the flusher as a direct reclaimer should not be throttled here
>> because it can not to reclaim clean file pages or anaonymous pages
>> for next moment; furthermore writeback rate of dirty pages would be
>> slow down and other direct reclaimers and kswapd would be affected.
>> bdi flusher should be iosceduled by get_request rather than here.
>>
>> Signed-off-by: z00281421 <[email protected]>
>
> What does this patch do that PF_LESS_THROTTLE is not doing already if
> there is an underlying BDI?
>
> There have been a few patches like this recently that look like they might
> do something useful but are subtle. They really should be accompanied by
> a test case and data showing they either fix a functional issue (machine
> livelocking due to writeback not making progress) or a performance issue.
>

2016-10-18 11:42:19

by Michal Hocko

[permalink] [raw]

Subject: Re: [PATCH] bdi flusher should not be throttled here when it fall into buddy slow path

On Tue 18-10-16 19:08:05, zhouxianrong wrote:
> Call trace:
> [<ffffffc0000863dc>] __switch_to+0x80/0x98
> [<ffffffc001160c58>] __schedule+0x314/0x854
> [<ffffffc0011611e0>] schedule+0x48/0xa4
> [<ffffffc0011648c4>] schedule_timeout+0x158/0x2c8
> [<ffffffc0011608b4>] io_schedule_timeout+0xbc/0x14c
> [<ffffffc0001aec84>] wait_iff_congested+0x1d4/0x1ec
> [<ffffffc0001a36b0>] shrink_inactive_list+0x530/0x760
> [<ffffffc0001a3e14>] shrink_lruvec+0x534/0x76c
> [<ffffffc0001a40d4>] shrink_zone+0x88/0x1b8
> [<ffffffc0001a4444>] do_try_to_free_pages+0x240/0x478
> [<ffffffc0001a4788>] try_to_free_pages+0x10c/0x284
> [<ffffffc0001968a4>] __alloc_pages_nodemask+0x540/0x918
> [<ffffffc0001dd0e8>] new_slab+0x334/0x4a0
> [<ffffffc0001df37c>] __slab_alloc.isra.75.constprop.77+0x6bc/0x780
> [<ffffffc0001df584>] kmem_cache_alloc+0x144/0x23c
> [<ffffffc00018f040>] mempool_alloc_slab+0x2c/0x38
> [<ffffffc00018f1f4>] mempool_alloc+0x7c/0x188
> [<ffffffc0003f462c>] bio_alloc_bioset+0x1cc/0x254
> [<ffffffc00022a430>] _submit_bh+0x74/0x1c8
> [<ffffffc00022c9d0>] __block_write_full_page.constprop.33+0x1a0/0x40c
> [<ffffffc00022cd1c>] block_write_full_page+0xe0/0x134
> [<ffffffc00022da64>] blkdev_writepage+0x30/0x3c
> [<ffffffc000197d08>] __writepage+0x34/0x74
> [<ffffffc000198880>] write_cache_pages+0x1e8/0x450
> [<ffffffc000198b3c>] generic_writepages+0x54/0x8c
> [<ffffffc00019a990>] do_writepages+0x40/0x6c
> [<ffffffc00021e604>] __writeback_single_inode+0x60/0x51c
> [<ffffffc00021eeec>] writeback_sb_inodes+0x2d4/0x46c
> [<ffffffc00021f128>] __writeback_inodes_wb+0xa4/0xe8
> [<ffffffc00021f480>] wb_writeback+0x314/0x3fc
> [<ffffffc000220224>] bdi_writeback_workfn+0x130/0x4e0
> [<ffffffc0000be4d4>] process_one_work+0x18c/0x51c
> [<ffffffc0000bedd8>] worker_thread+0x15c/0x51c
> [<ffffffc0000c5718>] kthread+0x10c/0x120
>
> the above calltrace occured when write sdcard under large and long pressure.
> the patch is a performance issue. i hope flusher do not be throttled just here and
> let it reclaim the successive clean file pages or anonymous pages on lru list
> and then return to write left dirty pages of inode. it would speed up write-back
> speed of dirty pages. so other direct reclaimers can reclaim more clean pages.
> in low memory caused by big pagecache bdi writeback speed play a key role.

If we got here then we are hitting into dirty/writeback pages on the
tail of the LRU list and the bdi is congested. So there are no clean
pages most probably and the storage doesn't catch up with that IO.

Why do you think that not throttling would help here? Do you really see
that the further reclaim really makes forward progress or it just wastes
more CPU without doing a useful work?

In other words much more information please!
--
Michal Hocko
SUSE Labs

2016-10-20 12:45:41

by zhouxianrong

[permalink] [raw]

Subject: [PATCH] bdi flusher should not be throttled here when it fall into buddy slow path

From: z00281421 <[email protected]>

The bdi flusher should be throttled only depends on
own bdi and is decoupled with others.

separate PGDAT_WRITEBACK into PGDAT_ANON_WRITEBACK and
PGDAT_FILE_WRITEBACK avoid scanning anon lru and it is ok
then throttled on file WRITEBACK.

i think above may be not right.

Signed-off-by: z00281421 <[email protected]>
---
fs/fs-writeback.c | 8 ++++++--
include/linux/mmzone.h | 7 +++++--
mm/vmscan.c | 20 ++++++++++++--------
3 files changed, 23 insertions(+), 12 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 05713a5..ddcc70f 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1905,10 +1905,13 @@ void wb_workfn(struct work_struct *work)
{
struct bdi_writeback *wb = container_of(to_delayed_work(work),
struct bdi_writeback, dwork);
+ struct backing_dev_info *bdi = container_of(to_delayed_work(work),
+ struct backing_dev_info, wb.dwork);
long pages_written;

set_worker_desc("flush-%s", dev_name(wb->bdi->dev));
- current->flags |= PF_SWAPWRITE;
+ current->flags |= (PF_SWAPWRITE | PF_LESS_THROTTLE);
+ current->bdi = bdi;

if (likely(!current_is_workqueue_rescuer() ||
!test_bit(WB_registered, &wb->state))) {
@@ -1938,7 +1941,8 @@ void wb_workfn(struct work_struct *work)
else if (wb_has_dirty_io(wb) && dirty_writeback_interval)
wb_wakeup_delayed(wb);

- current->flags &= ~PF_SWAPWRITE;
+ current->bdi = NULL;
+ current->flags &= ~(PF_SWAPWRITE | PF_LESS_THROTTLE);
}

/*
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 7f2ae99..fa602e9 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -528,8 +528,11 @@ enum pgdat_flags {
* many dirty file pages at the tail
* of the LRU.
*/
- PGDAT_WRITEBACK, /* reclaim scanning has recently found
- * many pages under writeback
+ PGDAT_ANON_WRITEBACK, /* reclaim scanning has recently found
+ * many anonymous pages under writeback
+ */
+ PGDAT_FILE_WRITEBACK, /* reclaim scanning has recently found
+ * many file pages under writeback
*/
PGDAT_RECLAIM_LOCKED, /* prevents concurrent reclaim */
};
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 0fe8b71..3f08ba3 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -917,6 +917,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
unsigned long nr_reclaimed = 0;
unsigned long nr_writeback = 0;
unsigned long nr_immediate = 0;
+ int file;

cond_resched();

@@ -954,6 +955,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
may_enter_fs = (sc->gfp_mask & __GFP_FS) ||
(PageSwapCache(page) && (sc->gfp_mask & __GFP_IO));

+ file = page_is_file_cache(page)
+
/*
* The number of dirty pages determines if a zone is marked
* reclaim_congested which affects wait_iff_congested. kswapd
@@ -1016,7 +1019,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
/* Case 1 above */
if (current_is_kswapd() &&
PageReclaim(page) &&
- test_bit(PGDAT_WRITEBACK, &pgdat->flags)) {
+ test_bit(PGDAT_ANON_WRITEBACK + file, &pgdat->flags)) {
nr_immediate++;
goto keep_locked;

@@ -1643,13 +1646,14 @@ putback_inactive_pages(struct lruvec *lruvec, struct list_head *page_list)
* If a kernel thread (such as nfsd for loop-back mounts) services
* a backing device by writing to the page cache it sets PF_LESS_THROTTLE.
* In that case we should only throttle if the backing device it is
- * writing to is congested. In other cases it is safe to throttle.
+ * writing to is congested. The bdi flusher should be throttled only depends
+ * on own bdi and is decoupled with others. In other cases it is safe to throttle.
*/
-static int current_may_throttle(void)
+static int current_may_throttle(int file)
{
return !(current->flags & PF_LESS_THROTTLE) ||
- current->backing_dev_info == NULL ||
- bdi_write_congested(current->backing_dev_info);
+ (file && (current->backing_dev_info == NULL ||
+ bdi_write_congested(current->backing_dev_info)));
}

static bool inactive_reclaimable_pages(struct lruvec *lruvec,
@@ -1774,7 +1778,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
* are encountered in the nr_immediate check below.
*/
if (nr_writeback && nr_writeback == nr_taken)
- set_bit(PGDAT_WRITEBACK, &pgdat->flags);
+ set_bit(PGDAT_ANON_WRITEBACK + file, &pgdat->flags);

/*
* Legacy memcg will stall in page writeback so avoid forcibly
@@ -1803,7 +1807,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
* that pages are cycling through the LRU faster than
* they are written so also forcibly stall.
*/
- if (nr_immediate && current_may_throttle())
+ if (nr_immediate && current_may_throttle(file))
congestion_wait(BLK_RW_ASYNC, HZ/10);
}

@@ -1813,7 +1817,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
* unqueued dirty pages or cycling through the LRU too quickly.
*/
if (!sc->hibernation_mode && !current_is_kswapd() &&
- current_may_throttle())
+ current_may_throttle(file))
wait_iff_congested(pgdat, BLK_RW_ASYNC, HZ/10);

trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
--
1.7.9.5

2016-10-20 13:28:50

by Michal Hocko

[permalink] [raw]

Subject: Re: [PATCH] bdi flusher should not be throttled here when it fall into buddy slow path

On Thu 20-10-16 20:38:05, [email protected] wrote:
> From: z00281421 <[email protected]>
>
> The bdi flusher should be throttled only depends on
> own bdi and is decoupled with others.
>
> separate PGDAT_WRITEBACK into PGDAT_ANON_WRITEBACK and
> PGDAT_FILE_WRITEBACK avoid scanning anon lru and it is ok
> then throttled on file WRITEBACK.

Could you please answer questions from
http://lkml.kernel.org/r/[email protected] before
coming up with new and even more complex patches please?

I would really like to understand the issue you are seeing before
jumping into patches...

Thanks!
--
Michal Hocko
SUSE Labs

2016-10-20 15:39:16

by Mika Penttilä

[permalink] [raw]

Subject: Re: [PATCH] bdi flusher should not be throttled here when it fall into buddy slow path

On 20.10.2016 15:38, [email protected] wrote:
> From: z00281421 <[email protected]>
>
> The bdi flusher should be throttled only depends on
> own bdi and is decoupled with others.
>
> separate PGDAT_WRITEBACK into PGDAT_ANON_WRITEBACK and
> PGDAT_FILE_WRITEBACK avoid scanning anon lru and it is ok
> then throttled on file WRITEBACK.
>
> i think above may be not right.
>
> Signed-off-by: z00281421 <[email protected]>
> ---
> fs/fs-writeback.c | 8 ++++++--
> include/linux/mmzone.h | 7 +++++--
> mm/vmscan.c | 20 ++++++++++++--------
> 3 files changed, 23 insertions(+), 12 deletions(-)
>
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 05713a5..ddcc70f 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -1905,10 +1905,13 @@ void wb_workfn(struct work_struct *work)
> {
> struct bdi_writeback *wb = container_of(to_delayed_work(work),
> struct bdi_writeback, dwork);
> + struct backing_dev_info *bdi = container_of(to_delayed_work(work),
> + struct backing_dev_info, wb.dwork);
> long pages_written;
>
> set_worker_desc("flush-%s", dev_name(wb->bdi->dev));
> - current->flags |= PF_SWAPWRITE;
> + current->flags |= (PF_SWAPWRITE | PF_LESS_THROTTLE);
> + current->bdi = bdi;
>
> if (likely(!current_is_workqueue_rescuer() ||
> !test_bit(WB_registered, &wb->state))) {
> @@ -1938,7 +1941,8 @@ void wb_workfn(struct work_struct *work)
> else if (wb_has_dirty_io(wb) && dirty_writeback_interval)
> wb_wakeup_delayed(wb);
>
> - current->flags &= ~PF_SWAPWRITE;
> + current->bdi = NULL;
> + current->flags &= ~(PF_SWAPWRITE | PF_LESS_THROTTLE);
> }
>
> /*
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 7f2ae99..fa602e9 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -528,8 +528,11 @@ enum pgdat_flags {
> * many dirty file pages at the tail
> * of the LRU.
> */
> - PGDAT_WRITEBACK, /* reclaim scanning has recently found
> - * many pages under writeback
> + PGDAT_ANON_WRITEBACK, /* reclaim scanning has recently found
> + * many anonymous pages under writeback
> + */
> + PGDAT_FILE_WRITEBACK, /* reclaim scanning has recently found
> + * many file pages under writeback
> */
> PGDAT_RECLAIM_LOCKED, /* prevents concurrent reclaim */

Nobody seems to be clearing those bits (same was with PGDAT_WRITEBACK) ?

--Mika