2017-09-19 19:53:15

by Jens Axboe

[permalink] [raw]
Subject: [PATCH 0/6] More graceful flusher thread memory reclaim wakeup

We've had some issues with writeback in presence of memory reclaim
at Facebook, and this patch set attempts to fix it up. The real
functional change is the last patch in the series, the first 5 are
prep and cleanup patches.

The basic idea is that we have callers that call
wakeup_flusher_threads() with nr_pages == 0. This means 'writeback
everything'. For memory reclaim situations, we can end up queuing
a TON of these kinds of writeback units. This can cause softlockups
and further memory issues, since we allocate huge amounts of
struct wb_writeback_work to handle this writeback. Handle this
situation more gracefully.

--
Jens Axboe


2017-09-19 19:53:20

by Jens Axboe

[permalink] [raw]
Subject: [PATCH 1/6] buffer: cleanup free_more_memory() flusher wakeup

This whole function is... interesting. Change the wakeup call
to the flusher threads to pass in nr_pages == 0, instead of
some random number of pages. This matches more closely what
similar cases do for memory shortage/reclaim.

Signed-off-by: Jens Axboe <[email protected]>
---
fs/buffer.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 170df856bdb9..9471a445e370 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -260,7 +260,7 @@ static void free_more_memory(void)
struct zoneref *z;
int nid;

- wakeup_flusher_threads(1024, WB_REASON_FREE_MORE_MEM);
+ wakeup_flusher_threads(0, WB_REASON_FREE_MORE_MEM);
yield();

for_each_online_node(nid) {
--
2.7.4

2017-09-19 19:53:24

by Jens Axboe

[permalink] [raw]
Subject: [PATCH 4/6] fs-writeback: make wb_start_writeback() static

We don't have any callers outside of fs-writeback.c anymore,
make it private.

Signed-off-by: Jens Axboe <[email protected]>
---
fs/fs-writeback.c | 4 ++--
include/linux/backing-dev.h | 2 --
2 files changed, 2 insertions(+), 4 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 03fda0830bf8..7564347914f8 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -933,8 +933,8 @@ static void bdi_split_work_to_wbs(struct backing_dev_info *bdi,

#endif /* CONFIG_CGROUP_WRITEBACK */

-void wb_start_writeback(struct bdi_writeback *wb, long nr_pages,
- bool range_cyclic, enum wb_reason reason)
+static void wb_start_writeback(struct bdi_writeback *wb, long nr_pages,
+ bool range_cyclic, enum wb_reason reason)
{
struct wb_writeback_work *work;

diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 854e1bdd0b2a..157e950a70dc 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -38,8 +38,6 @@ static inline struct backing_dev_info *bdi_alloc(gfp_t gfp_mask)
return bdi_alloc_node(gfp_mask, NUMA_NO_NODE);
}

-void wb_start_writeback(struct bdi_writeback *wb, long nr_pages,
- bool range_cyclic, enum wb_reason reason);
void wb_start_background_writeback(struct bdi_writeback *wb);
void wb_workfn(struct work_struct *work);
void wb_wakeup_delayed(struct bdi_writeback *wb);
--
2.7.4

2017-09-19 19:53:28

by Jens Axboe

[permalink] [raw]
Subject: [PATCH 6/6] fs-writeback: only allow one inflight and pending !nr_pages flush

A few callers pass in nr_pages == 0 when they wakeup the flusher
threads, which means that the flusher should just flush everything
that was currently dirty. If we are tight on memory, we can get
tons of these queued from kswapd/vmscan. This causes (at least)
two problems:

1) We consume a ton of memory just allocating writeback work items.
2) We spend so much time processing these work items, that we
introduce a softlockup in writeback processing.

Fix this by adding a 'zero_pages' bit to the writeback structure,
and set that when someone queues a nr_pages==0 flusher thread
wakeup. The bit is cleared when we start writeback on that work
item. If the bit is already set when we attempt to queue !nr_pages
writeback, then we simply ignore it.

This provides us one of full flush in flight, with one pending as
well, and makes for more efficient handling of this type of
writeback.

Signed-off-by: Jens Axboe <[email protected]>
---
fs/fs-writeback.c | 30 ++++++++++++++++++++++++++++--
include/linux/backing-dev-defs.h | 1 +
2 files changed, 29 insertions(+), 2 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index a9a86644cb9f..e0240110b36f 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -53,6 +53,7 @@ struct wb_writeback_work {
unsigned int for_background:1;
unsigned int for_sync:1; /* sync(2) WB_SYNC_ALL writeback */
unsigned int auto_free:1; /* free on completion */
+ unsigned int zero_pages:1; /* nr_pages == 0 writeback */
enum wb_reason reason; /* why was writeback initiated? */

struct list_head list; /* pending work list */
@@ -948,15 +949,25 @@ static void wb_start_writeback(struct bdi_writeback *wb, long nr_pages,
bool range_cyclic, enum wb_reason reason)
{
struct wb_writeback_work *work;
+ bool zero_pages = false;

if (!wb_has_dirty_io(wb))
return;

/*
- * If someone asked for zero pages, we write out the WORLD
+ * If someone asked for zero pages, we write out the WORLD.
+ * Places like vmscan and laptop mode want to queue a wakeup to
+ * the flusher threads to clean out everything. To avoid potentially
+ * having tons of these pending, ensure that we only allow one of
+ * them pending and inflight at the time
*/
- if (!nr_pages)
+ if (!nr_pages) {
+ if (test_bit(WB_zero_pages, &wb->state))
+ return;
+ set_bit(WB_zero_pages, &wb->state);
nr_pages = get_nr_dirty_pages();
+ zero_pages = true;
+ }

/*
* This is WB_SYNC_NONE writeback, so if allocation fails just
@@ -975,6 +986,7 @@ static void wb_start_writeback(struct bdi_writeback *wb, long nr_pages,
work->range_cyclic = range_cyclic;
work->reason = reason;
work->auto_free = 1;
+ work->zero_pages = zero_pages;

wb_queue_work(wb, work);
}
@@ -1828,6 +1840,14 @@ static struct wb_writeback_work *get_next_work_item(struct bdi_writeback *wb)
list_del_init(&work->list);
}
spin_unlock_bh(&wb->work_lock);
+
+ /*
+ * Once we start processing a work item that had !nr_pages,
+ * clear the wb state bit for that so we can allow more.
+ */
+ if (work && work->zero_pages && test_bit(WB_zero_pages, &wb->state))
+ clear_bit(WB_zero_pages, &wb->state);
+
return work;
}

@@ -1896,6 +1916,12 @@ static long wb_do_writeback(struct bdi_writeback *wb)
trace_writeback_exec(wb, work);
wrote += wb_writeback(wb, work);
finish_writeback_work(wb, work);
+
+ /*
+ * If we have a lot of pending work, make sure we take
+ * an occasional breather, if needed.
+ */
+ cond_resched();
}

/*
diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index 866c433e7d32..7494f6a75458 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -24,6 +24,7 @@ enum wb_state {
WB_shutting_down, /* wb_shutdown() in progress */
WB_writeback_running, /* Writeback is in progress */
WB_has_dirty_io, /* Dirty inodes on ->b_{dirty|io|more_io} */
+ WB_zero_pages, /* nr_pages == 0 flush pending */
};

enum wb_congested_state {
--
2.7.4

2017-09-19 19:53:53

by Jens Axboe

[permalink] [raw]
Subject: [PATCH 5/6] fs-writeback: move nr_pages == 0 logic to one location

Now that we have no external callers of wb_start_writeback(),
we can move the nr_pages == 0 logic into that function.

Signed-off-by: Jens Axboe <[email protected]>
---
fs/fs-writeback.c | 34 +++++++++++++++++-----------------
1 file changed, 17 insertions(+), 17 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 7564347914f8..a9a86644cb9f 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -933,6 +933,17 @@ static void bdi_split_work_to_wbs(struct backing_dev_info *bdi,

#endif /* CONFIG_CGROUP_WRITEBACK */

+/*
+ * Add in the number of potentially dirty inodes, because each inode
+ * write can dirty pagecache in the underlying blockdev.
+ */
+static unsigned long get_nr_dirty_pages(void)
+{
+ return global_node_page_state(NR_FILE_DIRTY) +
+ global_node_page_state(NR_UNSTABLE_NFS) +
+ get_nr_dirty_inodes();
+}
+
static void wb_start_writeback(struct bdi_writeback *wb, long nr_pages,
bool range_cyclic, enum wb_reason reason)
{
@@ -942,6 +953,12 @@ static void wb_start_writeback(struct bdi_writeback *wb, long nr_pages,
return;

/*
+ * If someone asked for zero pages, we write out the WORLD
+ */
+ if (!nr_pages)
+ nr_pages = get_nr_dirty_pages();
+
+ /*
* This is WB_SYNC_NONE writeback, so if allocation fails just
* wakeup the thread for old dirty data writeback
*/
@@ -1814,17 +1831,6 @@ static struct wb_writeback_work *get_next_work_item(struct bdi_writeback *wb)
return work;
}

-/*
- * Add in the number of potentially dirty inodes, because each inode
- * write can dirty pagecache in the underlying blockdev.
- */
-static unsigned long get_nr_dirty_pages(void)
-{
- return global_node_page_state(NR_FILE_DIRTY) +
- global_node_page_state(NR_UNSTABLE_NFS) +
- get_nr_dirty_inodes();
-}
-
static long wb_check_background_flush(struct bdi_writeback *wb)
{
if (wb_over_bg_thresh(wb)) {
@@ -1966,9 +1972,6 @@ static void __wakeup_flusher_threads_bdi(struct backing_dev_info *bdi,
void wakeup_flusher_threads_bdi(struct backing_dev_info *bdi, long nr_pages,
enum wb_reason reason)
{
- if (!nr_pages)
- nr_pages = get_nr_dirty_pages();
-
rcu_read_lock();
__wakeup_flusher_threads_bdi(bdi, nr_pages, reason);
rcu_read_unlock();
@@ -1988,9 +1991,6 @@ void wakeup_flusher_threads(long nr_pages, enum wb_reason reason)
if (blk_needs_flush_plug(current))
blk_schedule_flush_plug(current);

- if (!nr_pages)
- nr_pages = get_nr_dirty_pages();
-
rcu_read_lock();
list_for_each_entry_rcu(bdi, &bdi_list, bdi_list)
__wakeup_flusher_threads_bdi(bdi, nr_pages, reason);
--
2.7.4

2017-09-19 19:54:16

by Jens Axboe

[permalink] [raw]
Subject: [PATCH 3/6] page-writeback: pass in '0' for nr_pages writeback in laptop mode

Laptop mode really wants to writeback the number of dirty
pages and inodes. Instead of calculating this in the caller,
just pass in 0 and let wakeup_flusher_threads() handle it.

Use the new wakeup_flusher_threads_bdi() instead of rolling
our own.

Signed-off-by: Jens Axboe <[email protected]>
---
mm/page-writeback.c | 18 ++----------------
1 file changed, 2 insertions(+), 16 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 0b9c5cbe8eba..1933778c52c4 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1980,23 +1980,9 @@ int dirty_writeback_centisecs_handler(struct ctl_table *table, int write,
void laptop_mode_timer_fn(unsigned long data)
{
struct request_queue *q = (struct request_queue *)data;
- int nr_pages = global_node_page_state(NR_FILE_DIRTY) +
- global_node_page_state(NR_UNSTABLE_NFS);
- struct bdi_writeback *wb;

- /*
- * We want to write everything out, not just down to the dirty
- * threshold
- */
- if (!bdi_has_dirty_io(q->backing_dev_info))
- return;
-
- rcu_read_lock();
- list_for_each_entry_rcu(wb, &q->backing_dev_info->wb_list, bdi_node)
- if (wb_has_dirty_io(wb))
- wb_start_writeback(wb, nr_pages, true,
- WB_REASON_LAPTOP_TIMER);
- rcu_read_unlock();
+ wakeup_flusher_threads_bdi(q->backing_dev_info, 0,
+ WB_REASON_LAPTOP_TIMER);
}

/*
--
2.7.4

2017-09-19 19:54:37

by Jens Axboe

[permalink] [raw]
Subject: [PATCH 2/6] fs-writeback: provide a wakeup_flusher_threads_bdi()

Similar to wakeup_flusher_threads(), except that we only wake
up the flusher threads on the specified backing device.

No functional changes in this patch.

Signed-off-by: Jens Axboe <[email protected]>
---
fs/fs-writeback.c | 40 ++++++++++++++++++++++++++++++----------
include/linux/writeback.h | 2 ++
2 files changed, 32 insertions(+), 10 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 245c430a2e41..03fda0830bf8 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1947,6 +1947,34 @@ void wb_workfn(struct work_struct *work)
}

/*
+ * Start writeback of `nr_pages' pages on this bdi. If `nr_pages' is zero,
+ * write back the whole world.
+ */
+static void __wakeup_flusher_threads_bdi(struct backing_dev_info *bdi,
+ long nr_pages, enum wb_reason reason)
+{
+ struct bdi_writeback *wb;
+
+ if (!bdi_has_dirty_io(bdi))
+ return;
+
+ list_for_each_entry_rcu(wb, &bdi->wb_list, bdi_node)
+ wb_start_writeback(wb, wb_split_bdi_pages(wb, nr_pages),
+ false, reason);
+}
+
+void wakeup_flusher_threads_bdi(struct backing_dev_info *bdi, long nr_pages,
+ enum wb_reason reason)
+{
+ if (!nr_pages)
+ nr_pages = get_nr_dirty_pages();
+
+ rcu_read_lock();
+ __wakeup_flusher_threads_bdi(bdi, nr_pages, reason);
+ rcu_read_unlock();
+}
+
+/*
* Start writeback of `nr_pages' pages. If `nr_pages' is zero, write back
* the whole world.
*/
@@ -1964,16 +1992,8 @@ void wakeup_flusher_threads(long nr_pages, enum wb_reason reason)
nr_pages = get_nr_dirty_pages();

rcu_read_lock();
- list_for_each_entry_rcu(bdi, &bdi_list, bdi_list) {
- struct bdi_writeback *wb;
-
- if (!bdi_has_dirty_io(bdi))
- continue;
-
- list_for_each_entry_rcu(wb, &bdi->wb_list, bdi_node)
- wb_start_writeback(wb, wb_split_bdi_pages(wb, nr_pages),
- false, reason);
- }
+ list_for_each_entry_rcu(bdi, &bdi_list, bdi_list)
+ __wakeup_flusher_threads_bdi(bdi, nr_pages, reason);
rcu_read_unlock();
}

diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index d5815794416c..5a7ed74d1f6f 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -190,6 +190,8 @@ bool try_to_writeback_inodes_sb_nr(struct super_block *, unsigned long nr,
enum wb_reason reason);
void sync_inodes_sb(struct super_block *);
void wakeup_flusher_threads(long nr_pages, enum wb_reason reason);
+void wakeup_flusher_threads_bdi(struct backing_dev_info *bdi, long nr_pages,
+ enum wb_reason reason);
void inode_wait_for_writeback(struct inode *inode);

/* writeback.h requires fs.h; it, too, is not included from here. */
--
2.7.4

2017-09-19 20:05:20

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH 1/6] buffer: cleanup free_more_memory() flusher wakeup

On Tue, Sep 19, 2017 at 01:53:02PM -0600, Jens Axboe wrote:
> This whole function is... interesting. Change the wakeup call
> to the flusher threads to pass in nr_pages == 0, instead of
> some random number of pages. This matches more closely what
> similar cases do for memory shortage/reclaim.
>
> Signed-off-by: Jens Axboe <[email protected]>

Acked-by: Johannes Weiner <[email protected]>

2017-09-19 20:05:49

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH 2/6] fs-writeback: provide a wakeup_flusher_threads_bdi()

On Tue, Sep 19, 2017 at 01:53:03PM -0600, Jens Axboe wrote:
> Similar to wakeup_flusher_threads(), except that we only wake
> up the flusher threads on the specified backing device.
>
> No functional changes in this patch.
>
> Signed-off-by: Jens Axboe <[email protected]>

Acked-by: Johannes Weiner <[email protected]>

2017-09-19 20:06:44

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH 3/6] page-writeback: pass in '0' for nr_pages writeback in laptop mode

On Tue, Sep 19, 2017 at 01:53:04PM -0600, Jens Axboe wrote:
> Laptop mode really wants to writeback the number of dirty
> pages and inodes. Instead of calculating this in the caller,
> just pass in 0 and let wakeup_flusher_threads() handle it.
>
> Use the new wakeup_flusher_threads_bdi() instead of rolling
> our own.
>
> Signed-off-by: Jens Axboe <[email protected]>

Acked-by: Johannes Weiner <[email protected]>

2017-09-19 20:07:33

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH 4/6] fs-writeback: make wb_start_writeback() static

On Tue, Sep 19, 2017 at 01:53:05PM -0600, Jens Axboe wrote:
> We don't have any callers outside of fs-writeback.c anymore,
> make it private.
>
> Signed-off-by: Jens Axboe <[email protected]>

Acked-by: Johannes Weiner <[email protected]>

2017-09-19 20:08:07

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH 5/6] fs-writeback: move nr_pages == 0 logic to one location

On Tue, Sep 19, 2017 at 01:53:06PM -0600, Jens Axboe wrote:
> Now that we have no external callers of wb_start_writeback(),
> we can move the nr_pages == 0 logic into that function.
>
> Signed-off-by: Jens Axboe <[email protected]>

Acked-by: Johannes Weiner <[email protected]>

2017-09-19 20:18:47

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH 6/6] fs-writeback: only allow one inflight and pending !nr_pages flush

On Tue, Sep 19, 2017 at 01:53:07PM -0600, Jens Axboe wrote:
> A few callers pass in nr_pages == 0 when they wakeup the flusher
> threads, which means that the flusher should just flush everything
> that was currently dirty. If we are tight on memory, we can get
> tons of these queued from kswapd/vmscan. This causes (at least)
> two problems:
>
> 1) We consume a ton of memory just allocating writeback work items.
> 2) We spend so much time processing these work items, that we
> introduce a softlockup in writeback processing.
>
> Fix this by adding a 'zero_pages' bit to the writeback structure,
> and set that when someone queues a nr_pages==0 flusher thread
> wakeup. The bit is cleared when we start writeback on that work
> item. If the bit is already set when we attempt to queue !nr_pages
> writeback, then we simply ignore it.
>
> This provides us one of full flush in flight, with one pending as
> well, and makes for more efficient handling of this type of
> writeback.
>
> Signed-off-by: Jens Axboe <[email protected]>

Acked-by: Johannes Weiner <[email protected]>

Just a nitpick:

> @@ -948,15 +949,25 @@ static void wb_start_writeback(struct bdi_writeback *wb, long nr_pages,
> bool range_cyclic, enum wb_reason reason)
> {
> struct wb_writeback_work *work;
> + bool zero_pages = false;
>
> if (!wb_has_dirty_io(wb))
> return;
>
> /*
> - * If someone asked for zero pages, we write out the WORLD
> + * If someone asked for zero pages, we write out the WORLD.
> + * Places like vmscan and laptop mode want to queue a wakeup to
> + * the flusher threads to clean out everything. To avoid potentially
> + * having tons of these pending, ensure that we only allow one of
> + * them pending and inflight at the time
> */
> - if (!nr_pages)
> + if (!nr_pages) {
> + if (test_bit(WB_zero_pages, &wb->state))
> + return;
> + set_bit(WB_zero_pages, &wb->state);
> nr_pages = get_nr_dirty_pages();

We could rely on the work->older_than_this and pass LONG_MAX here
instead to write out the world as it was at the time wb commences.

get_nr_dirty_pages() is somewhat clearer on intent, but on the other
hand it returns global state and is used here in a split-bdi context,
and we can end up in sum requesting the system-wide dirty pages
several times over. It'll work fine, relying on work->older_than_this
to contain it also, it just seems a little ugly and subtle.

2017-09-19 20:39:38

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH 6/6] fs-writeback: only allow one inflight and pending !nr_pages flush

On 09/19/2017 02:18 PM, Johannes Weiner wrote:
> On Tue, Sep 19, 2017 at 01:53:07PM -0600, Jens Axboe wrote:
>> A few callers pass in nr_pages == 0 when they wakeup the flusher
>> threads, which means that the flusher should just flush everything
>> that was currently dirty. If we are tight on memory, we can get
>> tons of these queued from kswapd/vmscan. This causes (at least)
>> two problems:
>>
>> 1) We consume a ton of memory just allocating writeback work items.
>> 2) We spend so much time processing these work items, that we
>> introduce a softlockup in writeback processing.
>>
>> Fix this by adding a 'zero_pages' bit to the writeback structure,
>> and set that when someone queues a nr_pages==0 flusher thread
>> wakeup. The bit is cleared when we start writeback on that work
>> item. If the bit is already set when we attempt to queue !nr_pages
>> writeback, then we simply ignore it.
>>
>> This provides us one of full flush in flight, with one pending as
>> well, and makes for more efficient handling of this type of
>> writeback.
>>
>> Signed-off-by: Jens Axboe <[email protected]>
>
> Acked-by: Johannes Weiner <[email protected]>
>
> Just a nitpick:
>
>> @@ -948,15 +949,25 @@ static void wb_start_writeback(struct bdi_writeback *wb, long nr_pages,
>> bool range_cyclic, enum wb_reason reason)
>> {
>> struct wb_writeback_work *work;
>> + bool zero_pages = false;
>>
>> if (!wb_has_dirty_io(wb))
>> return;
>>
>> /*
>> - * If someone asked for zero pages, we write out the WORLD
>> + * If someone asked for zero pages, we write out the WORLD.
>> + * Places like vmscan and laptop mode want to queue a wakeup to
>> + * the flusher threads to clean out everything. To avoid potentially
>> + * having tons of these pending, ensure that we only allow one of
>> + * them pending and inflight at the time
>> */
>> - if (!nr_pages)
>> + if (!nr_pages) {
>> + if (test_bit(WB_zero_pages, &wb->state))
>> + return;
>> + set_bit(WB_zero_pages, &wb->state);
>> nr_pages = get_nr_dirty_pages();
>
> We could rely on the work->older_than_this and pass LONG_MAX here
> instead to write out the world as it was at the time wb commences.
>
> get_nr_dirty_pages() is somewhat clearer on intent, but on the other
> hand it returns global state and is used here in a split-bdi context,
> and we can end up in sum requesting the system-wide dirty pages
> several times over. It'll work fine, relying on work->older_than_this
> to contain it also, it just seems a little ugly and subtle.

Not disagreeing with that at all. I just carried the !nr_pages forward
as the way to do this. I think any further cleanup or work should just
be based on this patchset, I'd definitely welcome a change in that
direction.

Thanks for your reviews!

--
Jens Axboe

2017-09-20 01:57:16

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH 6/6] fs-writeback: only allow one inflight and pending !nr_pages flush

On 09/19/2017 01:53 PM, Jens Axboe wrote:
> @@ -948,15 +949,25 @@ static void wb_start_writeback(struct bdi_writeback *wb, long nr_pages,
> bool range_cyclic, enum wb_reason reason)
> {
> struct wb_writeback_work *work;
> + bool zero_pages = false;
>
> if (!wb_has_dirty_io(wb))
> return;
>
> /*
> - * If someone asked for zero pages, we write out the WORLD
> + * If someone asked for zero pages, we write out the WORLD.
> + * Places like vmscan and laptop mode want to queue a wakeup to
> + * the flusher threads to clean out everything. To avoid potentially
> + * having tons of these pending, ensure that we only allow one of
> + * them pending and inflight at the time
> */
> - if (!nr_pages)
> + if (!nr_pages) {
> + if (test_bit(WB_zero_pages, &wb->state))
> + return;
> + set_bit(WB_zero_pages, &wb->state);
> nr_pages = get_nr_dirty_pages();
> + zero_pages = true;
> + }

Later fix added here to ensure we clear WB_zero_pages, if work
allocation fails:

work = kzalloc(sizeof(*work),
GFP_NOWAIT | __GFP_NOMEMALLOC | __GFP_NOWARN);
if (!work) {
if (zero_pages)
clear_bit(WB_zero_pages, &wb->state);
[...]

Updated patch here:

http://git.kernel.dk/cgit/linux-block/commit/?h=writeback-fixup&id=21ea70657894fda9fccf257543cbec112b2813ef

--
Jens Axboe

2017-09-20 03:10:17

by Amir Goldstein

[permalink] [raw]
Subject: Re: [PATCH 6/6] fs-writeback: only allow one inflight and pending !nr_pages flush

On Tue, Sep 19, 2017 at 10:53 PM, Jens Axboe <[email protected]> wrote:
> A few callers pass in nr_pages == 0 when they wakeup the flusher
> threads, which means that the flusher should just flush everything
> that was currently dirty. If we are tight on memory, we can get
> tons of these queued from kswapd/vmscan. This causes (at least)
> two problems:
>
> 1) We consume a ton of memory just allocating writeback work items.
> 2) We spend so much time processing these work items, that we
> introduce a softlockup in writeback processing.
>
> Fix this by adding a 'zero_pages' bit to the writeback structure,
> and set that when someone queues a nr_pages==0 flusher thread
> wakeup. The bit is cleared when we start writeback on that work
> item. If the bit is already set when we attempt to queue !nr_pages
> writeback, then we simply ignore it.
>
> This provides us one of full flush in flight, with one pending as
> well, and makes for more efficient handling of this type of
> writeback.
>
> Signed-off-by: Jens Axboe <[email protected]>
> ---
> fs/fs-writeback.c | 30 ++++++++++++++++++++++++++++--
> include/linux/backing-dev-defs.h | 1 +
> 2 files changed, 29 insertions(+), 2 deletions(-)
>
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index a9a86644cb9f..e0240110b36f 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -53,6 +53,7 @@ struct wb_writeback_work {
> unsigned int for_background:1;
> unsigned int for_sync:1; /* sync(2) WB_SYNC_ALL writeback */
> unsigned int auto_free:1; /* free on completion */
> + unsigned int zero_pages:1; /* nr_pages == 0 writeback */

Suggest: use a name that describes the intention (e.g. WB_everything)

> enum wb_reason reason; /* why was writeback initiated? */
>
> struct list_head list; /* pending work list */
> @@ -948,15 +949,25 @@ static void wb_start_writeback(struct bdi_writeback *wb, long nr_pages,
> bool range_cyclic, enum wb_reason reason)
> {
> struct wb_writeback_work *work;
> + bool zero_pages = false;
>
> if (!wb_has_dirty_io(wb))
> return;
>
> /*
> - * If someone asked for zero pages, we write out the WORLD
> + * If someone asked for zero pages, we write out the WORLD.
> + * Places like vmscan and laptop mode want to queue a wakeup to
> + * the flusher threads to clean out everything. To avoid potentially
> + * having tons of these pending, ensure that we only allow one of
> + * them pending and inflight at the time
> */
> - if (!nr_pages)
> + if (!nr_pages) {
> + if (test_bit(WB_zero_pages, &wb->state))
> + return;
> + set_bit(WB_zero_pages, &wb->state);

Shouldn't this be test_and_set? not the worst outcome if you have more
than one pending work item, but still.

> nr_pages = get_nr_dirty_pages();
> + zero_pages = true;
> + }
>
> /*
> * This is WB_SYNC_NONE writeback, so if allocation fails just
> @@ -975,6 +986,7 @@ static void wb_start_writeback(struct bdi_writeback *wb, long nr_pages,
> work->range_cyclic = range_cyclic;
> work->reason = reason;
> work->auto_free = 1;
> + work->zero_pages = zero_pages;
>
> wb_queue_work(wb, work);
> }
> @@ -1828,6 +1840,14 @@ static struct wb_writeback_work *get_next_work_item(struct bdi_writeback *wb)
> list_del_init(&work->list);
> }
> spin_unlock_bh(&wb->work_lock);
> +
> + /*
> + * Once we start processing a work item that had !nr_pages,
> + * clear the wb state bit for that so we can allow more.
> + */
> + if (work && work->zero_pages && test_bit(WB_zero_pages, &wb->state))
> + clear_bit(WB_zero_pages, &wb->state);

nit: should not need to test_bit

> +
> return work;
> }
>
> @@ -1896,6 +1916,12 @@ static long wb_do_writeback(struct bdi_writeback *wb)
> trace_writeback_exec(wb, work);
> wrote += wb_writeback(wb, work);
> finish_writeback_work(wb, work);
> +
> + /*
> + * If we have a lot of pending work, make sure we take
> + * an occasional breather, if needed.
> + */
> + cond_resched();

Probably ought to be in a separate patch.

> }
>
> /*
> diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
> index 866c433e7d32..7494f6a75458 100644
> --- a/include/linux/backing-dev-defs.h
> +++ b/include/linux/backing-dev-defs.h
> @@ -24,6 +24,7 @@ enum wb_state {
> WB_shutting_down, /* wb_shutdown() in progress */
> WB_writeback_running, /* Writeback is in progress */
> WB_has_dirty_io, /* Dirty inodes on ->b_{dirty|io|more_io} */
> + WB_zero_pages, /* nr_pages == 0 flush pending */

same suggestion: WB_everything

Cheers,
Amir.

2017-09-20 04:13:31

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH 6/6] fs-writeback: only allow one inflight and pending !nr_pages flush

On 09/19/2017 09:10 PM, Amir Goldstein wrote:
> On Tue, Sep 19, 2017 at 10:53 PM, Jens Axboe <[email protected]> wrote:
>> A few callers pass in nr_pages == 0 when they wakeup the flusher
>> threads, which means that the flusher should just flush everything
>> that was currently dirty. If we are tight on memory, we can get
>> tons of these queued from kswapd/vmscan. This causes (at least)
>> two problems:
>>
>> 1) We consume a ton of memory just allocating writeback work items.
>> 2) We spend so much time processing these work items, that we
>> introduce a softlockup in writeback processing.
>>
>> Fix this by adding a 'zero_pages' bit to the writeback structure,
>> and set that when someone queues a nr_pages==0 flusher thread
>> wakeup. The bit is cleared when we start writeback on that work
>> item. If the bit is already set when we attempt to queue !nr_pages
>> writeback, then we simply ignore it.
>>
>> This provides us one of full flush in flight, with one pending as
>> well, and makes for more efficient handling of this type of
>> writeback.
>>
>> Signed-off-by: Jens Axboe <[email protected]>
>> ---
>> fs/fs-writeback.c | 30 ++++++++++++++++++++++++++++--
>> include/linux/backing-dev-defs.h | 1 +
>> 2 files changed, 29 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
>> index a9a86644cb9f..e0240110b36f 100644
>> --- a/fs/fs-writeback.c
>> +++ b/fs/fs-writeback.c
>> @@ -53,6 +53,7 @@ struct wb_writeback_work {
>> unsigned int for_background:1;
>> unsigned int for_sync:1; /* sync(2) WB_SYNC_ALL writeback */
>> unsigned int auto_free:1; /* free on completion */
>> + unsigned int zero_pages:1; /* nr_pages == 0 writeback */
>
> Suggest: use a name that describes the intention (e.g. WB_everything)

Agree, the name isn't the best. WB_everything isn't great either, though,
since this isn't an integrity write. WB_start_all would be better,
I'll make that change.

>> enum wb_reason reason; /* why was writeback initiated? */
>>
>> struct list_head list; /* pending work list */
>> @@ -948,15 +949,25 @@ static void wb_start_writeback(struct bdi_writeback *wb, long nr_pages,
>> bool range_cyclic, enum wb_reason reason)
>> {
>> struct wb_writeback_work *work;
>> + bool zero_pages = false;
>>
>> if (!wb_has_dirty_io(wb))
>> return;
>>
>> /*
>> - * If someone asked for zero pages, we write out the WORLD
>> + * If someone asked for zero pages, we write out the WORLD.
>> + * Places like vmscan and laptop mode want to queue a wakeup to
>> + * the flusher threads to clean out everything. To avoid potentially
>> + * having tons of these pending, ensure that we only allow one of
>> + * them pending and inflight at the time
>> */
>> - if (!nr_pages)
>> + if (!nr_pages) {
>> + if (test_bit(WB_zero_pages, &wb->state))
>> + return;
>> + set_bit(WB_zero_pages, &wb->state);
>
> Shouldn't this be test_and_set? not the worst outcome if you have more
> than one pending work item, but still.

If the frequency of these is high, and they were to trigger the bad
conditions we saw, then a split test + set is faster as it won't
keep re-dirtying the same cacheline from multiple locations. It's
better to leave it a little racy, but faster.

>> @@ -1828,6 +1840,14 @@ static struct wb_writeback_work *get_next_work_item(struct bdi_writeback *wb)
>> list_del_init(&work->list);
>> }
>> spin_unlock_bh(&wb->work_lock);
>> +
>> + /*
>> + * Once we start processing a work item that had !nr_pages,
>> + * clear the wb state bit for that so we can allow more.
>> + */
>> + if (work && work->zero_pages && test_bit(WB_zero_pages, &wb->state))
>> + clear_bit(WB_zero_pages, &wb->state);
>
> nit: should not need to test_bit

True, we can drop it for this case, as it'll be the common condition
anyway. I'll make that change.

>> @@ -1896,6 +1916,12 @@ static long wb_do_writeback(struct bdi_writeback *wb)
>> trace_writeback_exec(wb, work);
>> wrote += wb_writeback(wb, work);
>> finish_writeback_work(wb, work);
>> +
>> + /*
>> + * If we have a lot of pending work, make sure we take
>> + * an occasional breather, if needed.
>> + */
>> + cond_resched();
>
> Probably ought to be in a separate patch.

Yeah, it probably should be. It's not strictly needed with the other
change anyway, I will just drop it.

New version:

http://git.kernel.dk/cgit/linux-block/commit/?h=writeback-fixup&id=338a69c217cdaaffda93f3cc9a364a347f782adb

--
Jens Axboe

2017-09-20 06:05:24

by Amir Goldstein

[permalink] [raw]
Subject: Re: [PATCH 6/6] fs-writeback: only allow one inflight and pending !nr_pages flush

On Wed, Sep 20, 2017 at 7:13 AM, Jens Axboe <[email protected]> wrote:
> On 09/19/2017 09:10 PM, Amir Goldstein wrote:
>> On Tue, Sep 19, 2017 at 10:53 PM, Jens Axboe <[email protected]> wrote:
>>> A few callers pass in nr_pages == 0 when they wakeup the flusher
>>> threads, which means that the flusher should just flush everything
>>> that was currently dirty. If we are tight on memory, we can get
>>> tons of these queued from kswapd/vmscan. This causes (at least)
>>> two problems:
>>>
>>> 1) We consume a ton of memory just allocating writeback work items.
>>> 2) We spend so much time processing these work items, that we
>>> introduce a softlockup in writeback processing.
>>>
>>> Fix this by adding a 'zero_pages' bit to the writeback structure,
>>> and set that when someone queues a nr_pages==0 flusher thread
>>> wakeup. The bit is cleared when we start writeback on that work
>>> item. If the bit is already set when we attempt to queue !nr_pages
>>> writeback, then we simply ignore it.
>>>
>>> This provides us one of full flush in flight, with one pending as
>>> well, and makes for more efficient handling of this type of
>>> writeback.
>>>
>>> Signed-off-by: Jens Axboe <[email protected]>
>>> ---
>>> fs/fs-writeback.c | 30 ++++++++++++++++++++++++++++--
>>> include/linux/backing-dev-defs.h | 1 +
>>> 2 files changed, 29 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
>>> index a9a86644cb9f..e0240110b36f 100644
>>> --- a/fs/fs-writeback.c
>>> +++ b/fs/fs-writeback.c
>>> @@ -53,6 +53,7 @@ struct wb_writeback_work {
>>> unsigned int for_background:1;
>>> unsigned int for_sync:1; /* sync(2) WB_SYNC_ALL writeback */
>>> unsigned int auto_free:1; /* free on completion */
>>> + unsigned int zero_pages:1; /* nr_pages == 0 writeback */
>>
>> Suggest: use a name that describes the intention (e.g. WB_everything)
>
> Agree, the name isn't the best. WB_everything isn't great either, though,
> since this isn't an integrity write. WB_start_all would be better,
> I'll make that change.
>
>>> enum wb_reason reason; /* why was writeback initiated? */
>>>
>>> struct list_head list; /* pending work list */
>>> @@ -948,15 +949,25 @@ static void wb_start_writeback(struct bdi_writeback *wb, long nr_pages,
>>> bool range_cyclic, enum wb_reason reason)
>>> {
>>> struct wb_writeback_work *work;
>>> + bool zero_pages = false;
>>>
>>> if (!wb_has_dirty_io(wb))
>>> return;
>>>
>>> /*
>>> - * If someone asked for zero pages, we write out the WORLD
>>> + * If someone asked for zero pages, we write out the WORLD.
>>> + * Places like vmscan and laptop mode want to queue a wakeup to
>>> + * the flusher threads to clean out everything. To avoid potentially
>>> + * having tons of these pending, ensure that we only allow one of
>>> + * them pending and inflight at the time
>>> */
>>> - if (!nr_pages)
>>> + if (!nr_pages) {
>>> + if (test_bit(WB_zero_pages, &wb->state))
>>> + return;
>>> + set_bit(WB_zero_pages, &wb->state);
>>
>> Shouldn't this be test_and_set? not the worst outcome if you have more
>> than one pending work item, but still.
>
> If the frequency of these is high, and they were to trigger the bad
> conditions we saw, then a split test + set is faster as it won't
> keep re-dirtying the same cacheline from multiple locations. It's
> better to leave it a little racy, but faster.
>

Fare enough, but then better change the language of the commit message and
comment above not to claim that there can be only one pending work item.

Amir.

2017-09-20 12:35:35

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH 6/6] fs-writeback: only allow one inflight and pending !nr_pages flush

On 09/20/2017 12:05 AM, Amir Goldstein wrote:
> On Wed, Sep 20, 2017 at 7:13 AM, Jens Axboe <[email protected]> wrote:
>> On 09/19/2017 09:10 PM, Amir Goldstein wrote:
>>> On Tue, Sep 19, 2017 at 10:53 PM, Jens Axboe <[email protected]> wrote:
>>>> A few callers pass in nr_pages == 0 when they wakeup the flusher
>>>> threads, which means that the flusher should just flush everything
>>>> that was currently dirty. If we are tight on memory, we can get
>>>> tons of these queued from kswapd/vmscan. This causes (at least)
>>>> two problems:
>>>>
>>>> 1) We consume a ton of memory just allocating writeback work items.
>>>> 2) We spend so much time processing these work items, that we
>>>> introduce a softlockup in writeback processing.
>>>>
>>>> Fix this by adding a 'zero_pages' bit to the writeback structure,
>>>> and set that when someone queues a nr_pages==0 flusher thread
>>>> wakeup. The bit is cleared when we start writeback on that work
>>>> item. If the bit is already set when we attempt to queue !nr_pages
>>>> writeback, then we simply ignore it.
>>>>
>>>> This provides us one of full flush in flight, with one pending as
>>>> well, and makes for more efficient handling of this type of
>>>> writeback.
>>>>
>>>> Signed-off-by: Jens Axboe <[email protected]>
>>>> ---
>>>> fs/fs-writeback.c | 30 ++++++++++++++++++++++++++++--
>>>> include/linux/backing-dev-defs.h | 1 +
>>>> 2 files changed, 29 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
>>>> index a9a86644cb9f..e0240110b36f 100644
>>>> --- a/fs/fs-writeback.c
>>>> +++ b/fs/fs-writeback.c
>>>> @@ -53,6 +53,7 @@ struct wb_writeback_work {
>>>> unsigned int for_background:1;
>>>> unsigned int for_sync:1; /* sync(2) WB_SYNC_ALL writeback */
>>>> unsigned int auto_free:1; /* free on completion */
>>>> + unsigned int zero_pages:1; /* nr_pages == 0 writeback */
>>>
>>> Suggest: use a name that describes the intention (e.g. WB_everything)
>>
>> Agree, the name isn't the best. WB_everything isn't great either, though,
>> since this isn't an integrity write. WB_start_all would be better,
>> I'll make that change.
>>
>>>> enum wb_reason reason; /* why was writeback initiated? */
>>>>
>>>> struct list_head list; /* pending work list */
>>>> @@ -948,15 +949,25 @@ static void wb_start_writeback(struct bdi_writeback *wb, long nr_pages,
>>>> bool range_cyclic, enum wb_reason reason)
>>>> {
>>>> struct wb_writeback_work *work;
>>>> + bool zero_pages = false;
>>>>
>>>> if (!wb_has_dirty_io(wb))
>>>> return;
>>>>
>>>> /*
>>>> - * If someone asked for zero pages, we write out the WORLD
>>>> + * If someone asked for zero pages, we write out the WORLD.
>>>> + * Places like vmscan and laptop mode want to queue a wakeup to
>>>> + * the flusher threads to clean out everything. To avoid potentially
>>>> + * having tons of these pending, ensure that we only allow one of
>>>> + * them pending and inflight at the time
>>>> */
>>>> - if (!nr_pages)
>>>> + if (!nr_pages) {
>>>> + if (test_bit(WB_zero_pages, &wb->state))
>>>> + return;
>>>> + set_bit(WB_zero_pages, &wb->state);
>>>
>>> Shouldn't this be test_and_set? not the worst outcome if you have more
>>> than one pending work item, but still.
>>
>> If the frequency of these is high, and they were to trigger the bad
>> conditions we saw, then a split test + set is faster as it won't
>> keep re-dirtying the same cacheline from multiple locations. It's
>> better to leave it a little racy, but faster.
>>
>
> Fare enough, but then better change the language of the commit message and
> comment above not to claim that there can be only one pending work item.

That's unchanged, the commit message should be fine. We clear the
bit when we start the work item, so we can have one in flight and
one pending.

But it does reference 'zero_pages', I'll update that.

--
Jens Axboe

2017-09-20 14:17:30

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 1/6] buffer: cleanup free_more_memory() flusher wakeup

On Tue 19-09-17 13:53:02, Jens Axboe wrote:
> This whole function is... interesting. Change the wakeup call
> to the flusher threads to pass in nr_pages == 0, instead of
> some random number of pages. This matches more closely what
> similar cases do for memory shortage/reclaim.
>
> Signed-off-by: Jens Axboe <[email protected]>

Ok, probably makes sense. You can add:

Reviewed-by: Jan Kara <[email protected]>

BTW, after this nobody seems to use the number of pages for
wakeup_flusher_threads() so can you just delete the argument for the
function? After all system-wide wakeup is useful only for system wide
sync(2) or memory reclaim so number of pages isn't very useful...

Honza

> ---
> fs/buffer.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/fs/buffer.c b/fs/buffer.c
> index 170df856bdb9..9471a445e370 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -260,7 +260,7 @@ static void free_more_memory(void)
> struct zoneref *z;
> int nid;
>
> - wakeup_flusher_threads(1024, WB_REASON_FREE_MORE_MEM);
> + wakeup_flusher_threads(0, WB_REASON_FREE_MORE_MEM);
> yield();
>
> for_each_online_node(nid) {
> --
> 2.7.4
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2017-09-20 14:20:00

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 2/6] fs-writeback: provide a wakeup_flusher_threads_bdi()

On Tue 19-09-17 13:53:03, Jens Axboe wrote:
> Similar to wakeup_flusher_threads(), except that we only wake
> up the flusher threads on the specified backing device.
>
> No functional changes in this patch.
>
> Signed-off-by: Jens Axboe <[email protected]>

Looks good. You can add:

Reviewed-by: Jan Kara <[email protected]>

Honza

> ---
> fs/fs-writeback.c | 40 ++++++++++++++++++++++++++++++----------
> include/linux/writeback.h | 2 ++
> 2 files changed, 32 insertions(+), 10 deletions(-)
>
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 245c430a2e41..03fda0830bf8 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -1947,6 +1947,34 @@ void wb_workfn(struct work_struct *work)
> }
>
> /*
> + * Start writeback of `nr_pages' pages on this bdi. If `nr_pages' is zero,
> + * write back the whole world.
> + */
> +static void __wakeup_flusher_threads_bdi(struct backing_dev_info *bdi,
> + long nr_pages, enum wb_reason reason)
> +{
> + struct bdi_writeback *wb;
> +
> + if (!bdi_has_dirty_io(bdi))
> + return;
> +
> + list_for_each_entry_rcu(wb, &bdi->wb_list, bdi_node)
> + wb_start_writeback(wb, wb_split_bdi_pages(wb, nr_pages),
> + false, reason);
> +}
> +
> +void wakeup_flusher_threads_bdi(struct backing_dev_info *bdi, long nr_pages,
> + enum wb_reason reason)
> +{
> + if (!nr_pages)
> + nr_pages = get_nr_dirty_pages();
> +
> + rcu_read_lock();
> + __wakeup_flusher_threads_bdi(bdi, nr_pages, reason);
> + rcu_read_unlock();
> +}
> +
> +/*
> * Start writeback of `nr_pages' pages. If `nr_pages' is zero, write back
> * the whole world.
> */
> @@ -1964,16 +1992,8 @@ void wakeup_flusher_threads(long nr_pages, enum wb_reason reason)
> nr_pages = get_nr_dirty_pages();
>
> rcu_read_lock();
> - list_for_each_entry_rcu(bdi, &bdi_list, bdi_list) {
> - struct bdi_writeback *wb;
> -
> - if (!bdi_has_dirty_io(bdi))
> - continue;
> -
> - list_for_each_entry_rcu(wb, &bdi->wb_list, bdi_node)
> - wb_start_writeback(wb, wb_split_bdi_pages(wb, nr_pages),
> - false, reason);
> - }
> + list_for_each_entry_rcu(bdi, &bdi_list, bdi_list)
> + __wakeup_flusher_threads_bdi(bdi, nr_pages, reason);
> rcu_read_unlock();
> }
>
> diff --git a/include/linux/writeback.h b/include/linux/writeback.h
> index d5815794416c..5a7ed74d1f6f 100644
> --- a/include/linux/writeback.h
> +++ b/include/linux/writeback.h
> @@ -190,6 +190,8 @@ bool try_to_writeback_inodes_sb_nr(struct super_block *, unsigned long nr,
> enum wb_reason reason);
> void sync_inodes_sb(struct super_block *);
> void wakeup_flusher_threads(long nr_pages, enum wb_reason reason);
> +void wakeup_flusher_threads_bdi(struct backing_dev_info *bdi, long nr_pages,
> + enum wb_reason reason);
> void inode_wait_for_writeback(struct inode *inode);
>
> /* writeback.h requires fs.h; it, too, is not included from here. */
> --
> 2.7.4
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2017-09-20 14:35:08

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 3/6] page-writeback: pass in '0' for nr_pages writeback in laptop mode

On Tue 19-09-17 13:53:04, Jens Axboe wrote:
> Laptop mode really wants to writeback the number of dirty
> pages and inodes. Instead of calculating this in the caller,
> just pass in 0 and let wakeup_flusher_threads() handle it.
>
> Use the new wakeup_flusher_threads_bdi() instead of rolling
> our own.
>
> Signed-off-by: Jens Axboe <[email protected]>
...
> - rcu_read_lock();
> - list_for_each_entry_rcu(wb, &q->backing_dev_info->wb_list, bdi_node)
> - if (wb_has_dirty_io(wb))
> - wb_start_writeback(wb, nr_pages, true,
> - WB_REASON_LAPTOP_TIMER);
> - rcu_read_unlock();
> + wakeup_flusher_threads_bdi(q->backing_dev_info, 0,
> + WB_REASON_LAPTOP_TIMER);
> }

So this slightly changes the semantics since previously we were doing
range_cyclic writeback and now we don't. I don't think this matters in
practice but please mention that in the changelog. With that you can add:

Reviewed-by: Jan Kara <[email protected]>

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2017-09-20 14:35:42

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 4/6] fs-writeback: make wb_start_writeback() static

On Tue 19-09-17 13:53:05, Jens Axboe wrote:
> We don't have any callers outside of fs-writeback.c anymore,
> make it private.
>
> Signed-off-by: Jens Axboe <[email protected]>

You can add:

Reviewed-by: Jan Kara <[email protected]>

Honza

> ---
> fs/fs-writeback.c | 4 ++--
> include/linux/backing-dev.h | 2 --
> 2 files changed, 2 insertions(+), 4 deletions(-)
>
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 03fda0830bf8..7564347914f8 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -933,8 +933,8 @@ static void bdi_split_work_to_wbs(struct backing_dev_info *bdi,
>
> #endif /* CONFIG_CGROUP_WRITEBACK */
>
> -void wb_start_writeback(struct bdi_writeback *wb, long nr_pages,
> - bool range_cyclic, enum wb_reason reason)
> +static void wb_start_writeback(struct bdi_writeback *wb, long nr_pages,
> + bool range_cyclic, enum wb_reason reason)
> {
> struct wb_writeback_work *work;
>
> diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
> index 854e1bdd0b2a..157e950a70dc 100644
> --- a/include/linux/backing-dev.h
> +++ b/include/linux/backing-dev.h
> @@ -38,8 +38,6 @@ static inline struct backing_dev_info *bdi_alloc(gfp_t gfp_mask)
> return bdi_alloc_node(gfp_mask, NUMA_NO_NODE);
> }
>
> -void wb_start_writeback(struct bdi_writeback *wb, long nr_pages,
> - bool range_cyclic, enum wb_reason reason);
> void wb_start_background_writeback(struct bdi_writeback *wb);
> void wb_workfn(struct work_struct *work);
> void wb_wakeup_delayed(struct bdi_writeback *wb);
> --
> 2.7.4
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2017-09-20 14:42:04

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 5/6] fs-writeback: move nr_pages == 0 logic to one location

On Tue 19-09-17 13:53:06, Jens Axboe wrote:
> Now that we have no external callers of wb_start_writeback(),
> we can move the nr_pages == 0 logic into that function.
>
> Signed-off-by: Jens Axboe <[email protected]>

...

> +static unsigned long get_nr_dirty_pages(void)
> +{
> + return global_node_page_state(NR_FILE_DIRTY) +
> + global_node_page_state(NR_UNSTABLE_NFS) +
> + get_nr_dirty_inodes();
> +}
> +
> static void wb_start_writeback(struct bdi_writeback *wb, long nr_pages,
> bool range_cyclic, enum wb_reason reason)
> {
> @@ -942,6 +953,12 @@ static void wb_start_writeback(struct bdi_writeback *wb, long nr_pages,
> return;
>
> /*
> + * If someone asked for zero pages, we write out the WORLD
> + */
> + if (!nr_pages)
> + nr_pages = get_nr_dirty_pages();
> +

So for 'wb' we have a better estimate of the amount we should write - use
wb_stat_sum(wb, WB_RECLAIMABLE) statistics - that is essentially dirty +
unstable_nfs broken down to bdi_writeback.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2017-09-20 14:44:28

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 6/6] fs-writeback: only allow one inflight and pending !nr_pages flush

On Tue 19-09-17 22:13:25, Jens Axboe wrote:
> On 09/19/2017 09:10 PM, Amir Goldstein wrote:
> New version:
>
> http://git.kernel.dk/cgit/linux-block/commit/?h=writeback-fixup&id=338a69c217cdaaffda93f3cc9a364a347f782adb

This looks good to me. You can add:

Reviewed-by: Jan Kara <[email protected]>

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2017-09-20 15:05:58

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH 5/6] fs-writeback: move nr_pages == 0 logic to one location

On 09/20/2017 08:41 AM, Jan Kara wrote:
> On Tue 19-09-17 13:53:06, Jens Axboe wrote:
>> Now that we have no external callers of wb_start_writeback(),
>> we can move the nr_pages == 0 logic into that function.
>>
>> Signed-off-by: Jens Axboe <[email protected]>
>
> ...
>
>> +static unsigned long get_nr_dirty_pages(void)
>> +{
>> + return global_node_page_state(NR_FILE_DIRTY) +
>> + global_node_page_state(NR_UNSTABLE_NFS) +
>> + get_nr_dirty_inodes();
>> +}
>> +
>> static void wb_start_writeback(struct bdi_writeback *wb, long nr_pages,
>> bool range_cyclic, enum wb_reason reason)
>> {
>> @@ -942,6 +953,12 @@ static void wb_start_writeback(struct bdi_writeback *wb, long nr_pages,
>> return;
>>
>> /*
>> + * If someone asked for zero pages, we write out the WORLD
>> + */
>> + if (!nr_pages)
>> + nr_pages = get_nr_dirty_pages();
>> +
>
> So for 'wb' we have a better estimate of the amount we should write - use
> wb_stat_sum(wb, WB_RECLAIMABLE) statistics - that is essentially dirty +
> unstable_nfs broken down to bdi_writeback.

I don't mind making that change, but I think that should be a separate
patch. We're using get_nr_dirty_pages() in existing locations where
we have the 'wb', like in wb_check_old_data_flush().

--
Jens Axboe

2017-09-20 15:18:55

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH 1/6] buffer: cleanup free_more_memory() flusher wakeup

On 09/20/2017 08:17 AM, Jan Kara wrote:
> On Tue 19-09-17 13:53:02, Jens Axboe wrote:
>> This whole function is... interesting. Change the wakeup call
>> to the flusher threads to pass in nr_pages == 0, instead of
>> some random number of pages. This matches more closely what
>> similar cases do for memory shortage/reclaim.
>>
>> Signed-off-by: Jens Axboe <[email protected]>
>
> Ok, probably makes sense. You can add:
>
> Reviewed-by: Jan Kara <[email protected]>
>
> BTW, after this nobody seems to use the number of pages for
> wakeup_flusher_threads() so can you just delete the argument for the
> function? After all system-wide wakeup is useful only for system wide
> sync(2) or memory reclaim so number of pages isn't very useful...

Great observation! That's true, and if we kill that, it enables
further cleanups down the line for patch 5 and 6. I have
incorporated that.

--
Jens Axboe

2017-09-20 15:19:11

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH 3/6] page-writeback: pass in '0' for nr_pages writeback in laptop mode

On 09/20/2017 08:35 AM, Jan Kara wrote:
> On Tue 19-09-17 13:53:04, Jens Axboe wrote:
>> Laptop mode really wants to writeback the number of dirty
>> pages and inodes. Instead of calculating this in the caller,
>> just pass in 0 and let wakeup_flusher_threads() handle it.
>>
>> Use the new wakeup_flusher_threads_bdi() instead of rolling
>> our own.
>>
>> Signed-off-by: Jens Axboe <[email protected]>
> ...
>> - rcu_read_lock();
>> - list_for_each_entry_rcu(wb, &q->backing_dev_info->wb_list, bdi_node)
>> - if (wb_has_dirty_io(wb))
>> - wb_start_writeback(wb, nr_pages, true,
>> - WB_REASON_LAPTOP_TIMER);
>> - rcu_read_unlock();
>> + wakeup_flusher_threads_bdi(q->backing_dev_info, 0,
>> + WB_REASON_LAPTOP_TIMER);
>> }
>
> So this slightly changes the semantics since previously we were doing
> range_cyclic writeback and now we don't. I don't think this matters in
> practice but please mention that in the changelog. With that you can add:

Thanks, I added a note about that in the commit message.

--
Jens Axboe

2017-09-20 15:36:31

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 5/6] fs-writeback: move nr_pages == 0 logic to one location

On Wed 20-09-17 09:05:51, Jens Axboe wrote:
> On 09/20/2017 08:41 AM, Jan Kara wrote:
> > On Tue 19-09-17 13:53:06, Jens Axboe wrote:
> >> Now that we have no external callers of wb_start_writeback(),
> >> we can move the nr_pages == 0 logic into that function.
> >>
> >> Signed-off-by: Jens Axboe <[email protected]>
> >
> > ...
> >
> >> +static unsigned long get_nr_dirty_pages(void)
> >> +{
> >> + return global_node_page_state(NR_FILE_DIRTY) +
> >> + global_node_page_state(NR_UNSTABLE_NFS) +
> >> + get_nr_dirty_inodes();
> >> +}
> >> +
> >> static void wb_start_writeback(struct bdi_writeback *wb, long nr_pages,
> >> bool range_cyclic, enum wb_reason reason)
> >> {
> >> @@ -942,6 +953,12 @@ static void wb_start_writeback(struct bdi_writeback *wb, long nr_pages,
> >> return;
> >>
> >> /*
> >> + * If someone asked for zero pages, we write out the WORLD
> >> + */
> >> + if (!nr_pages)
> >> + nr_pages = get_nr_dirty_pages();
> >> +
> >
> > So for 'wb' we have a better estimate of the amount we should write - use
> > wb_stat_sum(wb, WB_RECLAIMABLE) statistics - that is essentially dirty +
> > unstable_nfs broken down to bdi_writeback.
>
> I don't mind making that change, but I think that should be a separate
> patch. We're using get_nr_dirty_pages() in existing locations where
> we have the 'wb', like in wb_check_old_data_flush().

Good point and fully agreed. So you can add:

Reviewed-by: Jan Kara <[email protected]>

for this patch.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2017-09-20 15:40:23

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH 5/6] fs-writeback: move nr_pages == 0 logic to one location

On 09/20/2017 09:36 AM, Jan Kara wrote:
> On Wed 20-09-17 09:05:51, Jens Axboe wrote:
>> On 09/20/2017 08:41 AM, Jan Kara wrote:
>>> On Tue 19-09-17 13:53:06, Jens Axboe wrote:
>>>> Now that we have no external callers of wb_start_writeback(),
>>>> we can move the nr_pages == 0 logic into that function.
>>>>
>>>> Signed-off-by: Jens Axboe <[email protected]>
>>>
>>> ...
>>>
>>>> +static unsigned long get_nr_dirty_pages(void)
>>>> +{
>>>> + return global_node_page_state(NR_FILE_DIRTY) +
>>>> + global_node_page_state(NR_UNSTABLE_NFS) +
>>>> + get_nr_dirty_inodes();
>>>> +}
>>>> +
>>>> static void wb_start_writeback(struct bdi_writeback *wb, long nr_pages,
>>>> bool range_cyclic, enum wb_reason reason)
>>>> {
>>>> @@ -942,6 +953,12 @@ static void wb_start_writeback(struct bdi_writeback *wb, long nr_pages,
>>>> return;
>>>>
>>>> /*
>>>> + * If someone asked for zero pages, we write out the WORLD
>>>> + */
>>>> + if (!nr_pages)
>>>> + nr_pages = get_nr_dirty_pages();
>>>> +
>>>
>>> So for 'wb' we have a better estimate of the amount we should write - use
>>> wb_stat_sum(wb, WB_RECLAIMABLE) statistics - that is essentially dirty +
>>> unstable_nfs broken down to bdi_writeback.
>>
>> I don't mind making that change, but I think that should be a separate
>> patch. We're using get_nr_dirty_pages() in existing locations where
>> we have the 'wb', like in wb_check_old_data_flush().
>
> Good point and fully agreed. So you can add:
>
> Reviewed-by: Jan Kara <[email protected]>

Thanks Jan, added. I just sent out the new version, mainly because the
removal or 'nr_pages' changes the later patches a bit. All for the
better, in my opinion.

--
Jens Axboe

2017-09-20 19:32:30

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH 0/6] More graceful flusher thread memory reclaim wakeup

On 09/20/2017 01:29 PM, John Stoffel wrote:
> On Tue, Sep 19, 2017 at 01:53:01PM -0600, Jens Axboe wrote:
>> We've had some issues with writeback in presence of memory reclaim
>> at Facebook, and this patch set attempts to fix it up. The real
>> functional change is the last patch in the series, the first 5 are
>> prep and cleanup patches.
>>
>> The basic idea is that we have callers that call
>> wakeup_flusher_threads() with nr_pages == 0. This means 'writeback
>> everything'. For memory reclaim situations, we can end up queuing
>> a TON of these kinds of writeback units. This can cause softlockups
>> and further memory issues, since we allocate huge amounts of
>> struct wb_writeback_work to handle this writeback. Handle this
>> situation more gracefully.
>
> This looks nice, but do you have any numbers to show how this improves
> things? I read the patches, but I'm not strong enough to comment on
> them at all. But I am interested in how this improves writeback under
> pressure, if at all.

Writeback should be about the same, it's mostly about preventing
softlockups and excessive memory usage, under conditions where we are
actively trying to reclaim/clean memory. It was bad enough to cause
softlockups for writeback work processing, while the pending writeback
work units grew to insane lengths.

--
Jens Axboe

2017-09-20 19:39:16

by John Stoffel

[permalink] [raw]
Subject: Re: [PATCH 0/6] More graceful flusher thread memory reclaim wakeup

On Tue, Sep 19, 2017 at 01:53:01PM -0600, Jens Axboe wrote:
> We've had some issues with writeback in presence of memory reclaim
> at Facebook, and this patch set attempts to fix it up. The real
> functional change is the last patch in the series, the first 5 are
> prep and cleanup patches.
>
> The basic idea is that we have callers that call
> wakeup_flusher_threads() with nr_pages == 0. This means 'writeback
> everything'. For memory reclaim situations, we can end up queuing
> a TON of these kinds of writeback units. This can cause softlockups
> and further memory issues, since we allocate huge amounts of
> struct wb_writeback_work to handle this writeback. Handle this
> situation more gracefully.

This looks nice, but do you have any numbers to show how this improves
things? I read the patches, but I'm not strong enough to comment on
them at all. But I am interested in how this improves writeback under
pressure, if at all.

John

2017-09-20 23:11:54

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH 0/6] More graceful flusher thread memory reclaim wakeup

[ Fixed up CC list. John, you're sending email with
From: John Stoffel <[email protected]> ]

On Wed, Sep 20, 2017 at 01:32:25PM -0600, Jens Axboe wrote:
> On 09/20/2017 01:29 PM, John Stoffel wrote:
> > On Tue, Sep 19, 2017 at 01:53:01PM -0600, Jens Axboe wrote:
> >> We've had some issues with writeback in presence of memory reclaim
> >> at Facebook, and this patch set attempts to fix it up. The real
> >> functional change is the last patch in the series, the first 5 are
> >> prep and cleanup patches.
> >>
> >> The basic idea is that we have callers that call
> >> wakeup_flusher_threads() with nr_pages == 0. This means 'writeback
> >> everything'. For memory reclaim situations, we can end up queuing
> >> a TON of these kinds of writeback units. This can cause softlockups
> >> and further memory issues, since we allocate huge amounts of
> >> struct wb_writeback_work to handle this writeback. Handle this
> >> situation more gracefully.
> >
> > This looks nice, but do you have any numbers to show how this improves
> > things? I read the patches, but I'm not strong enough to comment on
> > them at all. But I am interested in how this improves writeback under
> > pressure, if at all.
>
> Writeback should be about the same, it's mostly about preventing
> softlockups and excessive memory usage, under conditions where we are
> actively trying to reclaim/clean memory. It was bad enough to cause
> softlockups for writeback work processing, while the pending writeback
> work units grew to insane lengths.

In numbers, we have seen situations where we had 600 million writeback
work items queued up from reclaim under pressure. That's 35G worth of
work descriptors, and the machine was struggling to remain responsive
due to a lack of memory.

Once writeback against all outstanding dirty pages has been requested,
there really isn't a need to queue even a second work item; the job is
already being performed. We can queue the next one when it completes.