2017-09-27 20:14:17

by Jens Axboe

[permalink] [raw]
Subject: [PATCH 0/12 v3] Writeback improvements

We've had some issues with writeback in presence of memory reclaim
at Facebook, and this patch set attempts to fix it up. The real
functional change for that issue is patch 10. The rest are cleanups,
as well as the removal of doing non-range cyclic writeback. The users
of that was sync_inodes_sb() and wakeup_flusher_threads(), both of
which writeback all of the dirty pages.

The basic idea is that we have callers that call
wakeup_flusher_threads() with nr_pages == 0. This means 'writeback
everything'. For memory reclaim situations, we can end up queuing
a TON of these kinds of writeback units. This can cause softlockups
and further memory issues, since we allocate huge amounts of
struct wb_writeback_work to handle this writeback. Handle this
situation more gracefully.


drivers/md/bitmap.c | 2
drivers/staging/lustre/lustre/llite/rw.c | 25 ++-----
fs/afs/write.c | 25 +------
fs/btrfs/extent_io.c | 31 ++-------
fs/buffer.c | 60 +++---------------
fs/ceph/addr.c | 26 ++-----
fs/cifs/file.c | 20 +-----
fs/ext4/inode.c | 26 +++----
fs/f2fs/data.c | 26 ++-----
fs/fs-writeback.c | 103 +++++++++++++++++++------------
fs/gfs2/aops.c | 27 ++------
fs/ntfs/aops.c | 2
fs/ntfs/mft.c | 2
fs/sync.c | 2
include/linux/backing-dev-defs.h | 1
include/linux/backing-dev.h | 2
include/linux/buffer_head.h | 2
include/linux/writeback.h | 5 -
include/trace/events/btrfs.h | 2
include/trace/events/ext4.h | 2
include/trace/events/f2fs.h | 2
include/trace/events/writeback.h | 4 -
mm/page-writeback.c | 44 ++-----------
mm/vmscan.c | 2
24 files changed, 159 insertions(+), 284 deletions(-)


Changes since v2:

- Removal of non-range_cyclic writeback.
- Cleanup of the buffer.c failure handling code, utilize
__GFP_NOFAIL instead of rolling our own.
- Reinstate cyclic writeback for laptop mode, it's now the only
option available.
- Rebased on top of master, and series shuffled around.

Changes since v1:

- Rename WB_zero_pages to WB_start_all (Amir).
- Remove a test_bit() for a condition where we always expect the bit
to be set.
- Remove 'nr_pages' from the wakeup flusher threads helpers, since
everybody now passes in zero. Enables further cleanups in later
patches too (Jan).
- Fix a case where I forgot to clear WB_start_all if 'work' allocation
failed.
- Get rid of cond_resched() in the wb_do_writeback() loop.

--
Jens Axboe



2017-09-27 20:14:22

by Jens Axboe

[permalink] [raw]
Subject: [PATCH 03/12] buffer: eliminate the need to call free_more_memory() in __getblk_slow()

Since the previous commit removed any case where grow_buffers()
would return failure due to memory allocations, we can safely
remove the case where we have to call free_more_memory() in
this function.

Since this is also the last user of free_more_memory(), kill
it off completely.

Signed-off-by: Jens Axboe <[email protected]>
---
fs/buffer.c | 23 -----------------------
1 file changed, 23 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 3b60cd8456db..bff571dc7bc3 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -253,27 +253,6 @@ __find_get_block_slow(struct block_device *bdev, sector_t block)
}

/*
- * Kick the writeback threads then try to free up some ZONE_NORMAL memory.
- */
-static void free_more_memory(void)
-{
- struct zoneref *z;
- int nid;
-
- wakeup_flusher_threads(1024, WB_REASON_FREE_MORE_MEM);
- yield();
-
- for_each_online_node(nid) {
-
- z = first_zones_zonelist(node_zonelist(nid, GFP_NOFS),
- gfp_zone(GFP_NOFS), NULL);
- if (z->zone)
- try_to_free_pages(node_zonelist(nid, GFP_NOFS), 0,
- GFP_NOFS, NULL);
- }
-}
-
-/*
* I/O completion handler for block_read_full_page() - pages
* which come unlocked at the end of I/O.
*/
@@ -1086,8 +1065,6 @@ __getblk_slow(struct block_device *bdev, sector_t block,
ret = grow_buffers(bdev, block, size, gfp);
if (ret < 0)
return NULL;
- if (ret == 0)
- free_more_memory();
}
}

--
2.7.4

2017-09-27 20:14:28

by Jens Axboe

[permalink] [raw]
Subject: [PATCH 07/12] writeback: pass in '0' for nr_pages writeback in laptop mode

Laptop mode really wants to writeback the number of dirty
pages and inodes. Instead of calculating this in the caller,
just pass in 0 and let wakeup_flusher_threads() handle it.

Use the new wakeup_flusher_threads_bdi() instead of rolling
our own.

Acked-by: Johannes Weiner <[email protected]>
Tested-by: Chris Mason <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
---
mm/page-writeback.c | 17 +----------------
1 file changed, 1 insertion(+), 16 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 0b9c5cbe8eba..8d1fc593bce8 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1980,23 +1980,8 @@ int dirty_writeback_centisecs_handler(struct ctl_table *table, int write,
void laptop_mode_timer_fn(unsigned long data)
{
struct request_queue *q = (struct request_queue *)data;
- int nr_pages = global_node_page_state(NR_FILE_DIRTY) +
- global_node_page_state(NR_UNSTABLE_NFS);
- struct bdi_writeback *wb;

- /*
- * We want to write everything out, not just down to the dirty
- * threshold
- */
- if (!bdi_has_dirty_io(q->backing_dev_info))
- return;
-
- rcu_read_lock();
- list_for_each_entry_rcu(wb, &q->backing_dev_info->wb_list, bdi_node)
- if (wb_has_dirty_io(wb))
- wb_start_writeback(wb, nr_pages, true,
- WB_REASON_LAPTOP_TIMER);
- rcu_read_unlock();
+ wakeup_flusher_threads_bdi(q->backing_dev_info, WB_REASON_LAPTOP_TIMER);
}

/*
--
2.7.4

2017-09-27 20:14:38

by Jens Axboe

[permalink] [raw]
Subject: [PATCH 12/12] writeback: kill off ->range_cycle option

All types of writeback are now range cyclic. Kill off the member
from struct wb_writeback_work and struct writeback_control.

Remove the various checks for whether or not this is range cyclic
writeback or not in the writeback code and file systems.

For tracing, we leave the member there, and just always set it to 1.

Signed-off-by: Jens Axboe <[email protected]>
---
drivers/staging/lustre/lustre/llite/rw.c | 25 +++++++------------------
fs/afs/write.c | 25 ++++++-------------------
fs/btrfs/extent_io.c | 31 +++++++------------------------
fs/ceph/addr.c | 26 ++++++++------------------
fs/cifs/file.c | 20 +++++---------------
fs/ext4/inode.c | 26 ++++++++++----------------
fs/f2fs/data.c | 26 ++++++++------------------
fs/fs-writeback.c | 12 ++----------
fs/gfs2/aops.c | 27 ++++++++-------------------
include/linux/writeback.h | 1 -
include/trace/events/btrfs.h | 2 +-
include/trace/events/ext4.h | 2 +-
include/trace/events/f2fs.h | 2 +-
include/trace/events/writeback.h | 4 ++--
mm/page-writeback.c | 27 ++++++++-------------------
15 files changed, 74 insertions(+), 182 deletions(-)

diff --git a/drivers/staging/lustre/lustre/llite/rw.c b/drivers/staging/lustre/lustre/llite/rw.c
index e72090572bcc..d5960dd20405 100644
--- a/drivers/staging/lustre/lustre/llite/rw.c
+++ b/drivers/staging/lustre/lustre/llite/rw.c
@@ -1002,18 +1002,8 @@ int ll_writepages(struct address_space *mapping, struct writeback_control *wbc)
int result;
int ignore_layout = 0;

- if (wbc->range_cyclic) {
- start = mapping->writeback_index << PAGE_SHIFT;
- end = OBD_OBJECT_EOF;
- } else {
- start = wbc->range_start;
- end = wbc->range_end;
- if (end == LLONG_MAX) {
- end = OBD_OBJECT_EOF;
- range_whole = start == 0;
- }
- }
-
+ start = mapping->writeback_index << PAGE_SHIFT;
+ end = OBD_OBJECT_EOF;
mode = CL_FSYNC_NONE;
if (wbc->sync_mode == WB_SYNC_ALL)
mode = CL_FSYNC_LOCAL;
@@ -1034,12 +1024,11 @@ int ll_writepages(struct address_space *mapping, struct writeback_control *wbc)
result = 0;
}

- if (wbc->range_cyclic || (range_whole && wbc->nr_to_write > 0)) {
- if (end == OBD_OBJECT_EOF)
- mapping->writeback_index = 0;
- else
- mapping->writeback_index = (end >> PAGE_SHIFT) + 1;
- }
+ if (end == OBD_OBJECT_EOF)
+ mapping->writeback_index = 0;
+ else
+ mapping->writeback_index = (end >> PAGE_SHIFT) + 1;
+
return result;
}

diff --git a/fs/afs/write.c b/fs/afs/write.c
index 106e43db1115..2e21c197173f 100644
--- a/fs/afs/write.c
+++ b/fs/afs/write.c
@@ -570,24 +570,12 @@ int afs_writepages(struct address_space *mapping,

_enter("");

- if (wbc->range_cyclic) {
- start = mapping->writeback_index;
- end = -1;
- ret = afs_writepages_region(mapping, wbc, start, end, &next);
- if (start > 0 && wbc->nr_to_write > 0 && ret == 0)
- ret = afs_writepages_region(mapping, wbc, 0, start,
- &next);
- mapping->writeback_index = next;
- } else if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX) {
- end = (pgoff_t)(LLONG_MAX >> PAGE_SHIFT);
- ret = afs_writepages_region(mapping, wbc, 0, end, &next);
- if (wbc->nr_to_write > 0)
- mapping->writeback_index = next;
- } else {
- start = wbc->range_start >> PAGE_SHIFT;
- end = wbc->range_end >> PAGE_SHIFT;
- ret = afs_writepages_region(mapping, wbc, start, end, &next);
- }
+ start = mapping->writeback_index;
+ end = -1;
+ ret = afs_writepages_region(mapping, wbc, start, end, &next);
+ if (start > 0 && wbc->nr_to_write > 0 && ret == 0)
+ ret = afs_writepages_region(mapping, wbc, 0, start, &next);
+ mapping->writeback_index = next;

_leave(" = %d", ret);
return ret;
@@ -685,7 +673,6 @@ int afs_writeback_all(struct afs_vnode *vnode)
struct writeback_control wbc = {
.sync_mode = WB_SYNC_ALL,
.nr_to_write = LONG_MAX,
- .range_cyclic = 1,
};
int ret;

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 3e5bb0cdd3cd..5c3bd7d50435 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3803,14 +3803,8 @@ int btree_write_cache_pages(struct address_space *mapping,
int tag;

pagevec_init(&pvec, 0);
- if (wbc->range_cyclic) {
- index = mapping->writeback_index; /* Start from prev offset */
- end = -1;
- } else {
- index = wbc->range_start >> PAGE_SHIFT;
- end = wbc->range_end >> PAGE_SHIFT;
- scanned = 1;
- }
+ index = mapping->writeback_index; /* Start from prev offset */
+ end = -1;
if (wbc->sync_mode == WB_SYNC_ALL)
tag = PAGECACHE_TAG_TOWRITE;
else
@@ -3830,7 +3824,7 @@ int btree_write_cache_pages(struct address_space *mapping,
if (!PagePrivate(page))
continue;

- if (!wbc->range_cyclic && page->index > end) {
+ if (page->index > end) {
done = 1;
break;
}
@@ -3930,7 +3924,6 @@ static int extent_write_cache_pages(struct address_space *mapping,
pgoff_t index;
pgoff_t end; /* Inclusive */
pgoff_t done_index;
- int range_whole = 0;
int scanned = 0;
int tag;

@@ -3947,16 +3940,8 @@ static int extent_write_cache_pages(struct address_space *mapping,
return 0;

pagevec_init(&pvec, 0);
- if (wbc->range_cyclic) {
- index = mapping->writeback_index; /* Start from prev offset */
- end = -1;
- } else {
- index = wbc->range_start >> PAGE_SHIFT;
- end = wbc->range_end >> PAGE_SHIFT;
- if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX)
- range_whole = 1;
- scanned = 1;
- }
+ index = mapping->writeback_index; /* Start from prev offset */
+ end = -1;
if (wbc->sync_mode == WB_SYNC_ALL)
tag = PAGECACHE_TAG_TOWRITE;
else
@@ -3992,7 +3977,7 @@ static int extent_write_cache_pages(struct address_space *mapping,
continue;
}

- if (!wbc->range_cyclic && page->index > end) {
+ if (page->index > end) {
done = 1;
unlock_page(page);
continue;
@@ -4051,9 +4036,7 @@ static int extent_write_cache_pages(struct address_space *mapping,
goto retry;
}

- if (wbc->range_cyclic || (wbc->nr_to_write > 0 && range_whole))
- mapping->writeback_index = done_index;
-
+ mapping->writeback_index = done_index;
btrfs_add_delayed_iput(inode);
return ret;
}
diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index b3e3edc09d80..5aa5a901c448 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -791,7 +791,7 @@ static int ceph_writepages_start(struct address_space *mapping,
unsigned int wsize = i_blocksize(inode);
struct ceph_osd_request *req = NULL;
struct ceph_writeback_ctl ceph_wbc;
- bool should_loop, range_whole = false;
+ bool should_loop;
bool stop, done = false;

dout("writepages_start %p (mode=%s)\n", inode,
@@ -812,7 +812,7 @@ static int ceph_writepages_start(struct address_space *mapping,

pagevec_init(&pvec, 0);

- start_index = wbc->range_cyclic ? mapping->writeback_index : 0;
+ start_index = mapping->writeback_index;
index = start_index;

retry:
@@ -830,19 +830,11 @@ static int ceph_writepages_start(struct address_space *mapping,
should_loop = false;
if (ceph_wbc.head_snapc && snapc != last_snapc) {
/* where to start/end? */
- if (wbc->range_cyclic) {
- index = start_index;
- end = -1;
- if (index > 0)
- should_loop = true;
- dout(" cyclic, start at %lu\n", index);
- } else {
- index = wbc->range_start >> PAGE_SHIFT;
- end = wbc->range_end >> PAGE_SHIFT;
- if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX)
- range_whole = true;
- dout(" not cyclic, %lu to %lu\n", index, end);
- }
+ index = start_index;
+ end = -1;
+ if (index > 0)
+ should_loop = true;
+ dout(" cyclic, start at %lu\n", index);
} else if (!ceph_wbc.head_snapc) {
/* Do not respect wbc->range_{start,end}. Dirty pages
* in that range can be associated with newer snapc.
@@ -1194,9 +1186,7 @@ static int ceph_writepages_start(struct address_space *mapping,
goto retry;
}

- if (wbc->range_cyclic || (range_whole && wbc->nr_to_write > 0))
- mapping->writeback_index = index;
-
+ mapping->writeback_index = index;
out:
ceph_osdc_put_request(req);
ceph_put_snap_context(last_snapc);
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 92fdf9c35de2..98498afc3f87 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -2021,7 +2021,7 @@ wdata_prepare_pages(struct cifs_writedata *wdata, unsigned int found_pages,
break;
}

- if (!wbc->range_cyclic && page->index > end) {
+ if (page->index > end) {
*done = true;
unlock_page(page);
break;
@@ -2112,7 +2112,7 @@ static int cifs_writepages(struct address_space *mapping,
{
struct cifs_sb_info *cifs_sb = CIFS_SB(mapping->host->i_sb);
struct TCP_Server_Info *server;
- bool done = false, scanned = false, range_whole = false;
+ bool done = false, scanned = false;
pgoff_t end, index;
struct cifs_writedata *wdata;
int rc = 0;
@@ -2124,16 +2124,8 @@ static int cifs_writepages(struct address_space *mapping,
if (cifs_sb->wsize < PAGE_SIZE)
return generic_writepages(mapping, wbc);

- if (wbc->range_cyclic) {
- index = mapping->writeback_index; /* Start from prev offset */
- end = -1;
- } else {
- index = wbc->range_start >> PAGE_SHIFT;
- end = wbc->range_end >> PAGE_SHIFT;
- if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX)
- range_whole = true;
- scanned = true;
- }
+ index = mapping->writeback_index; /* Start from prev offset */
+ end = -1;
server = cifs_sb_master_tcon(cifs_sb)->ses->server;
retry:
while (!done && index <= end) {
@@ -2214,9 +2206,7 @@ static int cifs_writepages(struct address_space *mapping,
goto retry;
}

- if (wbc->range_cyclic || (range_whole && wbc->nr_to_write > 0))
- mapping->writeback_index = index;
-
+ mapping->writeback_index = index;
return rc;
}

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 31db875bc7a1..f33f700b110c 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2792,16 +2792,11 @@ static int ext4_writepages(struct address_space *mapping,
if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX)
range_whole = 1;

- if (wbc->range_cyclic) {
- writeback_index = mapping->writeback_index;
- if (writeback_index)
- cycled = 0;
- mpd.first_page = writeback_index;
- mpd.last_page = -1;
- } else {
- mpd.first_page = wbc->range_start >> PAGE_SHIFT;
- mpd.last_page = wbc->range_end >> PAGE_SHIFT;
- }
+ writeback_index = mapping->writeback_index;
+ if (writeback_index)
+ cycled = 0;
+ mpd.first_page = writeback_index;
+ mpd.last_page = -1;

mpd.inode = inode;
mpd.wbc = wbc;
@@ -2940,12 +2935,11 @@ static int ext4_writepages(struct address_space *mapping,
}

/* Update index */
- if (wbc->range_cyclic || (range_whole && wbc->nr_to_write > 0))
- /*
- * Set the writeback_index so that range_cyclic
- * mode will write it back later
- */
- mapping->writeback_index = mpd.first_page;
+ /*
+ * Set the writeback_index so that range_cyclic
+ * mode will write it back later
+ */
+ mapping->writeback_index = mpd.first_page;

out_writepages:
trace_ext4_writepages_result(inode, wbc, ret,
diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
index 36b535207c88..05edac5fdf2e 100644
--- a/fs/f2fs/data.c
+++ b/fs/f2fs/data.c
@@ -1632,7 +1632,6 @@ static int f2fs_write_cache_pages(struct address_space *mapping,
pgoff_t done_index;
pgoff_t last_idx = ULONG_MAX;
int cycled;
- int range_whole = 0;
int tag;

pagevec_init(&pvec, 0);
@@ -1643,21 +1642,13 @@ static int f2fs_write_cache_pages(struct address_space *mapping,
else
clear_inode_flag(mapping->host, FI_HOT_DATA);

- if (wbc->range_cyclic) {
- writeback_index = mapping->writeback_index; /* prev offset */
- index = writeback_index;
- if (index == 0)
- cycled = 1;
- else
- cycled = 0;
- end = -1;
- } else {
- index = wbc->range_start >> PAGE_SHIFT;
- end = wbc->range_end >> PAGE_SHIFT;
- if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX)
- range_whole = 1;
- cycled = 1; /* ignore range_cyclic tests */
- }
+ writeback_index = mapping->writeback_index; /* prev offset */
+ index = writeback_index;
+ if (index == 0)
+ cycled = 1;
+ else
+ cycled = 0;
+ end = -1;
if (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages)
tag = PAGECACHE_TAG_TOWRITE;
else
@@ -1755,8 +1746,7 @@ static int f2fs_write_cache_pages(struct address_space *mapping,
end = writeback_index - 1;
goto retry;
}
- if (wbc->range_cyclic || (range_whole && wbc->nr_to_write > 0))
- mapping->writeback_index = done_index;
+ mapping->writeback_index = done_index;

if (last_idx != ULONG_MAX)
f2fs_submit_merged_write_cond(F2FS_M_SB(mapping), mapping->host,
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 686d510e618c..65bbc627addf 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -49,7 +49,6 @@ struct wb_writeback_work {
enum writeback_sync_modes sync_mode;
unsigned int tagged_writepages:1;
unsigned int for_kupdate:1;
- unsigned int range_cyclic:1;
unsigned int for_background:1;
unsigned int for_sync:1; /* sync(2) WB_SYNC_ALL writeback */
unsigned int auto_free:1; /* free on completion */
@@ -945,8 +944,7 @@ static unsigned long get_nr_dirty_pages(void)
get_nr_dirty_inodes();
}

-static void wb_start_writeback(struct bdi_writeback *wb, bool range_cyclic,
- enum wb_reason reason)
+static void wb_start_writeback(struct bdi_writeback *wb, enum wb_reason reason)
{
struct wb_writeback_work *work;

@@ -982,7 +980,6 @@ static void wb_start_writeback(struct bdi_writeback *wb, bool range_cyclic,

work->sync_mode = WB_SYNC_NONE;
work->nr_pages = wb_split_bdi_pages(wb, get_nr_dirty_pages());
- work->range_cyclic = range_cyclic;
work->reason = reason;
work->auto_free = 1;
work->start_all = 1;
@@ -1526,7 +1523,6 @@ static long writeback_sb_inodes(struct super_block *sb,
.for_kupdate = work->for_kupdate,
.for_background = work->for_background,
.for_sync = work->for_sync,
- .range_cyclic = work->range_cyclic,
.range_start = 0,
.range_end = LLONG_MAX,
};
@@ -1698,7 +1694,6 @@ static long writeback_inodes_wb(struct bdi_writeback *wb, long nr_pages,
struct wb_writeback_work work = {
.nr_pages = nr_pages,
.sync_mode = WB_SYNC_NONE,
- .range_cyclic = 1,
.reason = reason,
};
struct blk_plug plug;
@@ -1858,7 +1853,6 @@ static long wb_check_background_flush(struct bdi_writeback *wb)
.nr_pages = LONG_MAX,
.sync_mode = WB_SYNC_NONE,
.for_background = 1,
- .range_cyclic = 1,
.reason = WB_REASON_BACKGROUND,
};

@@ -1892,7 +1886,6 @@ static long wb_check_old_data_flush(struct bdi_writeback *wb)
.nr_pages = nr_pages,
.sync_mode = WB_SYNC_NONE,
.for_kupdate = 1,
- .range_cyclic = 1,
.reason = WB_REASON_PERIODIC,
};

@@ -1984,7 +1977,7 @@ static void __wakeup_flusher_threads_bdi(struct backing_dev_info *bdi,
return;

list_for_each_entry_rcu(wb, &bdi->wb_list, bdi_node)
- wb_start_writeback(wb, true, reason);
+ wb_start_writeback(wb, reason);
}

void wakeup_flusher_threads_bdi(struct backing_dev_info *bdi,
@@ -2428,7 +2421,6 @@ void sync_inodes_sb(struct super_block *sb)
.sb = sb,
.sync_mode = WB_SYNC_ALL,
.nr_pages = LONG_MAX,
- .range_cyclic = 1,
.done = &done,
.reason = WB_REASON_SYNC,
.for_sync = 1,
diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c
index 68ed06962537..4e9d86dd7413 100644
--- a/fs/gfs2/aops.c
+++ b/fs/gfs2/aops.c
@@ -384,25 +384,16 @@ static int gfs2_write_cache_jdata(struct address_space *mapping,
pgoff_t end;
pgoff_t done_index;
int cycled;
- int range_whole = 0;
int tag;

pagevec_init(&pvec, 0);
- if (wbc->range_cyclic) {
- writeback_index = mapping->writeback_index; /* prev offset */
- index = writeback_index;
- if (index == 0)
- cycled = 1;
- else
- cycled = 0;
- end = -1;
- } else {
- index = wbc->range_start >> PAGE_SHIFT;
- end = wbc->range_end >> PAGE_SHIFT;
- if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX)
- range_whole = 1;
- cycled = 1; /* ignore range_cyclic tests */
- }
+ writeback_index = mapping->writeback_index; /* prev offset */
+ index = writeback_index;
+ if (index == 0)
+ cycled = 1;
+ else
+ cycled = 0;
+ end = -1;
if (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages)
tag = PAGECACHE_TAG_TOWRITE;
else
@@ -439,9 +430,7 @@ static int gfs2_write_cache_jdata(struct address_space *mapping,
goto retry;
}

- if (wbc->range_cyclic || (range_whole && wbc->nr_to_write > 0))
- mapping->writeback_index = done_index;
-
+ mapping->writeback_index = done_index;
return ret;
}

diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 9c0091678af4..52319fbf300f 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -87,7 +87,6 @@ struct writeback_control {
unsigned for_background:1; /* A background writeback */
unsigned tagged_writepages:1; /* tag-and-write to avoid livelock */
unsigned for_reclaim:1; /* Invoked from the page allocator */
- unsigned range_cyclic:1; /* range_start is cyclic */
unsigned for_sync:1; /* sync(2) WB_SYNC_ALL writeback */
#ifdef CONFIG_CGROUP_WRITEBACK
struct bdi_writeback *wb; /* wb this writeback is issued under */
diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
index dc1d0df91e0b..734349288fec 100644
--- a/include/trace/events/btrfs.h
+++ b/include/trace/events/btrfs.h
@@ -502,7 +502,7 @@ DECLARE_EVENT_CLASS(btrfs__writepage,
__entry->range_end = wbc->range_end;
__entry->for_kupdate = wbc->for_kupdate;
__entry->for_reclaim = wbc->for_reclaim;
- __entry->range_cyclic = wbc->range_cyclic;
+ __entry->range_cyclic = 1;
__entry->writeback_index = inode->i_mapping->writeback_index;
__entry->root_objectid =
BTRFS_I(inode)->root->root_key.objectid;
diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index 9c3bc3883d2f..e1de610895f1 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -394,7 +394,7 @@ TRACE_EVENT(ext4_writepages,
__entry->writeback_index = inode->i_mapping->writeback_index;
__entry->sync_mode = wbc->sync_mode;
__entry->for_kupdate = wbc->for_kupdate;
- __entry->range_cyclic = wbc->range_cyclic;
+ __entry->range_cyclic = 1;
),

TP_printk("dev %d,%d ino %lu nr_to_write %ld pages_skipped %ld "
diff --git a/include/trace/events/f2fs.h b/include/trace/events/f2fs.h
index 5d216f7fb05a..e291e4717bf7 100644
--- a/include/trace/events/f2fs.h
+++ b/include/trace/events/f2fs.h
@@ -1174,7 +1174,7 @@ TRACE_EVENT(f2fs_writepages,
__entry->for_background = wbc->for_background;
__entry->tagged_writepages = wbc->tagged_writepages;
__entry->for_reclaim = wbc->for_reclaim;
- __entry->range_cyclic = wbc->range_cyclic;
+ __entry->range_cyclic = 1;
__entry->for_sync = wbc->for_sync;
),

diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
index 9b57f014d79d..22e40e54943f 100644
--- a/include/trace/events/writeback.h
+++ b/include/trace/events/writeback.h
@@ -225,7 +225,7 @@ DECLARE_EVENT_CLASS(writeback_work_class,
__entry->sb_dev = work->sb ? work->sb->s_dev : 0;
__entry->sync_mode = work->sync_mode;
__entry->for_kupdate = work->for_kupdate;
- __entry->range_cyclic = work->range_cyclic;
+ __entry->range_cyclic = 1;
__entry->for_background = work->for_background;
__entry->reason = work->reason;
__entry->cgroup_ino = __trace_wb_assign_cgroup(wb);
@@ -328,7 +328,7 @@ DECLARE_EVENT_CLASS(wbc_class,
__entry->for_kupdate = wbc->for_kupdate;
__entry->for_background = wbc->for_background;
__entry->for_reclaim = wbc->for_reclaim;
- __entry->range_cyclic = wbc->range_cyclic;
+ __entry->range_cyclic = 1;
__entry->range_start = (long)wbc->range_start;
__entry->range_end = (long)wbc->range_end;
__entry->cgroup_ino = __trace_wbc_assign_cgroup(wbc);
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 8d1fc593bce8..507c18eb33fa 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2149,25 +2149,16 @@ int write_cache_pages(struct address_space *mapping,
pgoff_t end; /* Inclusive */
pgoff_t done_index;
int cycled;
- int range_whole = 0;
int tag;

pagevec_init(&pvec, 0);
- if (wbc->range_cyclic) {
- writeback_index = mapping->writeback_index; /* prev offset */
- index = writeback_index;
- if (index == 0)
- cycled = 1;
- else
- cycled = 0;
- end = -1;
- } else {
- index = wbc->range_start >> PAGE_SHIFT;
- end = wbc->range_end >> PAGE_SHIFT;
- if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX)
- range_whole = 1;
- cycled = 1; /* ignore range_cyclic tests */
- }
+ writeback_index = mapping->writeback_index; /* prev offset */
+ index = writeback_index;
+ if (index == 0)
+ cycled = 1;
+ else
+ cycled = 0;
+ end = -1;
if (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages)
tag = PAGECACHE_TAG_TOWRITE;
else
@@ -2285,9 +2276,7 @@ int write_cache_pages(struct address_space *mapping,
end = writeback_index - 1;
goto retry;
}
- if (wbc->range_cyclic || (range_whole && wbc->nr_to_write > 0))
- mapping->writeback_index = done_index;
-
+ mapping->writeback_index = done_index;
return ret;
}
EXPORT_SYMBOL(write_cache_pages);
--
2.7.4

2017-09-27 20:14:36

by Jens Axboe

[permalink] [raw]
Subject: [PATCH 10/12] writeback: only allow one inflight and pending full flush

When someone calls wakeup_flusher_threads() or
wakeup_flusher_threads_bdi(), they schedule writeback of all dirty
pages in the system (or on that bdi). If we are tight on memory, we
can get tons of these queued from kswapd/vmscan. This causes (at
least) two problems:

1) We consume a ton of memory just allocating writeback work items.
We've seen as much as 600 million of these writeback work items
pending. That's a lot of memory to pointlessly hold hostage,
while the box is under memory pressure.

2) We spend so much time processing these work items, that we
introduce a softlockup in writeback processing. This is because
each of the writeback work items don't end up doing any work (it's
hard when you have millions of identical ones coming in to the
flush machinery), so we just sit in a tight loop pulling work
items and deleting/freeing them.

Fix this by adding a 'start_all' bit to the writeback structure, and
set that when someone attempts to flush all dirty pages. The bit is
cleared when we start writeback on that work item. If the bit is
already set when we attempt to queue !nr_pages writeback, then we
simply ignore it.

This provides us one full flush in flight, with one pending as well,
and makes for more efficient handling of this type of writeback.

Acked-by: Johannes Weiner <[email protected]>
Tested-by: Chris Mason <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
---
fs/fs-writeback.c | 25 +++++++++++++++++++++++++
include/linux/backing-dev-defs.h | 1 +
2 files changed, 26 insertions(+)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index d9be3dc34302..2e46a1537e10 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -53,6 +53,7 @@ struct wb_writeback_work {
unsigned int for_background:1;
unsigned int for_sync:1; /* sync(2) WB_SYNC_ALL writeback */
unsigned int auto_free:1; /* free on completion */
+ unsigned int start_all:1; /* nr_pages == 0 (all) writeback */
enum wb_reason reason; /* why was writeback initiated? */

struct list_head list; /* pending work list */
@@ -953,12 +954,27 @@ static void wb_start_writeback(struct bdi_writeback *wb, bool range_cyclic,
return;

/*
+ * All callers of this function want to start writeback of all
+ * dirty pages. Places like vmscan can call this at a very
+ * high frequency, causing pointless allocations of tons of
+ * work items and keeping the flusher threads busy retrieving
+ * that work. Ensure that we only allow one of them pending and
+ * inflight at the time. It doesn't matter if we race a little
+ * bit on this, so use the faster separate test/set bit variants.
+ */
+ if (test_bit(WB_start_all, &wb->state))
+ return;
+
+ set_bit(WB_start_all, &wb->state);
+
+ /*
* This is WB_SYNC_NONE writeback, so if allocation fails just
* wakeup the thread for old dirty data writeback
*/
work = kzalloc(sizeof(*work),
GFP_NOWAIT | __GFP_NOMEMALLOC | __GFP_NOWARN);
if (!work) {
+ clear_bit(WB_start_all, &wb->state);
trace_writeback_nowork(wb);
wb_wakeup(wb);
return;
@@ -969,6 +985,7 @@ static void wb_start_writeback(struct bdi_writeback *wb, bool range_cyclic,
work->range_cyclic = range_cyclic;
work->reason = reason;
work->auto_free = 1;
+ work->start_all = 1;

wb_queue_work(wb, work);
}
@@ -1822,6 +1839,14 @@ static struct wb_writeback_work *get_next_work_item(struct bdi_writeback *wb)
list_del_init(&work->list);
}
spin_unlock_bh(&wb->work_lock);
+
+ /*
+ * Once we start processing a work item that had !nr_pages,
+ * clear the wb state bit for that so we can allow more.
+ */
+ if (work && work->start_all)
+ clear_bit(WB_start_all, &wb->state);
+
return work;
}

diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index 866c433e7d32..420de5c7c7f9 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -24,6 +24,7 @@ enum wb_state {
WB_shutting_down, /* wb_shutdown() in progress */
WB_writeback_running, /* Writeback is in progress */
WB_has_dirty_io, /* Dirty inodes on ->b_{dirty|io|more_io} */
+ WB_start_all, /* nr_pages == 0 (all) work pending */
};

enum wb_congested_state {
--
2.7.4

2017-09-27 20:14:34

by Jens Axboe

[permalink] [raw]
Subject: [PATCH 11/12] writeback: make sync_inodes_sb() use range cyclic writeback

It's flushing all the dirty pages for this superblock, there's
no point in not doing range cyclic writeback.

Signed-off-by: Jens Axboe <[email protected]>
---
fs/fs-writeback.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 2e46a1537e10..686d510e618c 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -2428,7 +2428,7 @@ void sync_inodes_sb(struct super_block *sb)
.sb = sb,
.sync_mode = WB_SYNC_ALL,
.nr_pages = LONG_MAX,
- .range_cyclic = 0,
+ .range_cyclic = 1,
.done = &done,
.reason = WB_REASON_SYNC,
.for_sync = 1,
--
2.7.4

2017-09-27 20:15:26

by Jens Axboe

[permalink] [raw]
Subject: [PATCH 09/12] writeback: move nr_pages == 0 logic to one location

Now that we have no external callers of wb_start_writeback(), we
can shuffle the passing in of 'nr_pages'. Everybody passes in 0
at this point, so just kill the argument and move the dirty
count retrieval to that function.

Acked-by: Johannes Weiner <[email protected]>
Tested-by: Chris Mason <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
---
fs/fs-writeback.c | 42 ++++++++++++++++++------------------------
1 file changed, 18 insertions(+), 24 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 571a5702b568..d9be3dc34302 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -933,8 +933,19 @@ static void bdi_split_work_to_wbs(struct backing_dev_info *bdi,

#endif /* CONFIG_CGROUP_WRITEBACK */

-static void wb_start_writeback(struct bdi_writeback *wb, long nr_pages,
- bool range_cyclic, enum wb_reason reason)
+/*
+ * Add in the number of potentially dirty inodes, because each inode
+ * write can dirty pagecache in the underlying blockdev.
+ */
+static unsigned long get_nr_dirty_pages(void)
+{
+ return global_node_page_state(NR_FILE_DIRTY) +
+ global_node_page_state(NR_UNSTABLE_NFS) +
+ get_nr_dirty_inodes();
+}
+
+static void wb_start_writeback(struct bdi_writeback *wb, bool range_cyclic,
+ enum wb_reason reason)
{
struct wb_writeback_work *work;

@@ -954,7 +965,7 @@ static void wb_start_writeback(struct bdi_writeback *wb, long nr_pages,
}

work->sync_mode = WB_SYNC_NONE;
- work->nr_pages = nr_pages;
+ work->nr_pages = wb_split_bdi_pages(wb, get_nr_dirty_pages());
work->range_cyclic = range_cyclic;
work->reason = reason;
work->auto_free = 1;
@@ -1814,17 +1825,6 @@ static struct wb_writeback_work *get_next_work_item(struct bdi_writeback *wb)
return work;
}

-/*
- * Add in the number of potentially dirty inodes, because each inode
- * write can dirty pagecache in the underlying blockdev.
- */
-static unsigned long get_nr_dirty_pages(void)
-{
- return global_node_page_state(NR_FILE_DIRTY) +
- global_node_page_state(NR_UNSTABLE_NFS) +
- get_nr_dirty_inodes();
-}
-
static long wb_check_background_flush(struct bdi_writeback *wb)
{
if (wb_over_bg_thresh(wb)) {
@@ -1951,7 +1951,7 @@ void wb_workfn(struct work_struct *work)
* write back the whole world.
*/
static void __wakeup_flusher_threads_bdi(struct backing_dev_info *bdi,
- long nr_pages, enum wb_reason reason)
+ enum wb_reason reason)
{
struct bdi_writeback *wb;

@@ -1959,17 +1959,14 @@ static void __wakeup_flusher_threads_bdi(struct backing_dev_info *bdi,
return;

list_for_each_entry_rcu(wb, &bdi->wb_list, bdi_node)
- wb_start_writeback(wb, wb_split_bdi_pages(wb, nr_pages),
- true, reason);
+ wb_start_writeback(wb, true, reason);
}

void wakeup_flusher_threads_bdi(struct backing_dev_info *bdi,
enum wb_reason reason)
{
- long nr_pages = get_nr_dirty_pages();
-
rcu_read_lock();
- __wakeup_flusher_threads_bdi(bdi, nr_pages, reason);
+ __wakeup_flusher_threads_bdi(bdi, reason);
rcu_read_unlock();
}

@@ -1979,7 +1976,6 @@ void wakeup_flusher_threads_bdi(struct backing_dev_info *bdi,
void wakeup_flusher_threads(enum wb_reason reason)
{
struct backing_dev_info *bdi;
- long nr_pages;

/*
* If we are expecting writeback progress we must submit plugged IO.
@@ -1987,11 +1983,9 @@ void wakeup_flusher_threads(enum wb_reason reason)
if (blk_needs_flush_plug(current))
blk_schedule_flush_plug(current);

- nr_pages = get_nr_dirty_pages();
-
rcu_read_lock();
list_for_each_entry_rcu(bdi, &bdi_list, bdi_list)
- __wakeup_flusher_threads_bdi(bdi, nr_pages, reason);
+ __wakeup_flusher_threads_bdi(bdi, reason);
rcu_read_unlock();
}

--
2.7.4

2017-09-27 20:14:20

by Jens Axboe

[permalink] [raw]
Subject: [PATCH 02/12] buffer: grow_dev_page() should use __GFP_NOFAIL for all cases

We currently it it for find_or_create_page(), which means that it
cannot fail. Ensure we also pass in 'retry == true' to
alloc_page_buffers(), which also ensure that it cannot fail.

After this, there are no failure cases in grow_dev_page() that
occur because of a failed memory allocation.

Signed-off-by: Jens Axboe <[email protected]>
---
fs/buffer.c | 6 +-----
1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 1234ae343aef..3b60cd8456db 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -988,8 +988,6 @@ grow_dev_page(struct block_device *bdev, sector_t block,
gfp_mask |= __GFP_NOFAIL;

page = find_or_create_page(inode->i_mapping, index, gfp_mask);
- if (!page)
- return ret;

BUG_ON(!PageLocked(page));

@@ -1008,9 +1006,7 @@ grow_dev_page(struct block_device *bdev, sector_t block,
/*
* Allocate some buffers for this page
*/
- bh = alloc_page_buffers(page, size, false);
- if (!bh)
- goto failed;
+ bh = alloc_page_buffers(page, size, true);

/*
* Link the page to the buffers and initialise them. Take the
--
2.7.4

2017-09-27 20:15:43

by Jens Axboe

[permalink] [raw]
Subject: [PATCH 06/12] writeback: provide a wakeup_flusher_threads_bdi()

Similar to wakeup_flusher_threads(), except that we only wake
up the flusher threads on the specified backing device.

No functional changes in this patch.

Acked-by: Johannes Weiner <[email protected]>
Tested-by: Chris Mason <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
---
fs/fs-writeback.c | 39 +++++++++++++++++++++++++++++----------
include/linux/writeback.h | 2 ++
2 files changed, 31 insertions(+), 10 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 65e6992d8719..1a9b45250437 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1947,6 +1947,33 @@ void wb_workfn(struct work_struct *work)
}

/*
+ * Start writeback of `nr_pages' pages on this bdi. If `nr_pages' is zero,
+ * write back the whole world.
+ */
+static void __wakeup_flusher_threads_bdi(struct backing_dev_info *bdi,
+ long nr_pages, enum wb_reason reason)
+{
+ struct bdi_writeback *wb;
+
+ if (!bdi_has_dirty_io(bdi))
+ return;
+
+ list_for_each_entry_rcu(wb, &bdi->wb_list, bdi_node)
+ wb_start_writeback(wb, wb_split_bdi_pages(wb, nr_pages),
+ true, reason);
+}
+
+void wakeup_flusher_threads_bdi(struct backing_dev_info *bdi,
+ enum wb_reason reason)
+{
+ long nr_pages = get_nr_dirty_pages();
+
+ rcu_read_lock();
+ __wakeup_flusher_threads_bdi(bdi, nr_pages, reason);
+ rcu_read_unlock();
+}
+
+/*
* Wakeup the flusher threads to start writeback of all currently dirty pages
*/
void wakeup_flusher_threads(enum wb_reason reason)
@@ -1963,16 +1990,8 @@ void wakeup_flusher_threads(enum wb_reason reason)
nr_pages = get_nr_dirty_pages();

rcu_read_lock();
- list_for_each_entry_rcu(bdi, &bdi_list, bdi_list) {
- struct bdi_writeback *wb;
-
- if (!bdi_has_dirty_io(bdi))
- continue;
-
- list_for_each_entry_rcu(wb, &bdi->wb_list, bdi_node)
- wb_start_writeback(wb, wb_split_bdi_pages(wb, nr_pages),
- true, reason);
- }
+ list_for_each_entry_rcu(bdi, &bdi_list, bdi_list)
+ __wakeup_flusher_threads_bdi(bdi, nr_pages, reason);
rcu_read_unlock();
}

diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 1f9c6db5e29a..9c0091678af4 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -190,6 +190,8 @@ bool try_to_writeback_inodes_sb_nr(struct super_block *, unsigned long nr,
enum wb_reason reason);
void sync_inodes_sb(struct super_block *);
void wakeup_flusher_threads(enum wb_reason reason);
+void wakeup_flusher_threads_bdi(struct backing_dev_info *bdi,
+ enum wb_reason reason);
void inode_wait_for_writeback(struct inode *inode);

/* writeback.h requires fs.h; it, too, is not included from here. */
--
2.7.4

2017-09-27 20:15:41

by Jens Axboe

[permalink] [raw]
Subject: [PATCH 08/12] writeback: make wb_start_writeback() static

We don't have any callers outside of fs-writeback.c anymore,
make it private.

Acked-by: Johannes Weiner <[email protected]>
Tested-by: Chris Mason <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
---
fs/fs-writeback.c | 4 ++--
include/linux/backing-dev.h | 2 --
2 files changed, 2 insertions(+), 4 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 1a9b45250437..571a5702b568 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -933,8 +933,8 @@ static void bdi_split_work_to_wbs(struct backing_dev_info *bdi,

#endif /* CONFIG_CGROUP_WRITEBACK */

-void wb_start_writeback(struct bdi_writeback *wb, long nr_pages,
- bool range_cyclic, enum wb_reason reason)
+static void wb_start_writeback(struct bdi_writeback *wb, long nr_pages,
+ bool range_cyclic, enum wb_reason reason)
{
struct wb_writeback_work *work;

diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 854e1bdd0b2a..157e950a70dc 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -38,8 +38,6 @@ static inline struct backing_dev_info *bdi_alloc(gfp_t gfp_mask)
return bdi_alloc_node(gfp_mask, NUMA_NO_NODE);
}

-void wb_start_writeback(struct bdi_writeback *wb, long nr_pages,
- bool range_cyclic, enum wb_reason reason);
void wb_start_background_writeback(struct bdi_writeback *wb);
void wb_workfn(struct work_struct *work);
void wb_wakeup_delayed(struct bdi_writeback *wb);
--
2.7.4

2017-09-27 20:16:20

by Jens Axboe

[permalink] [raw]
Subject: [PATCH 05/12] writeback: switch wakeup_flusher_threads() to cyclic writeback

We're writing back the full range of dirty pages on the devices,
there's no point in making this special and not do normal range
cyclic writeback.

Signed-off-by: Jens Axboe <[email protected]>
---
fs/fs-writeback.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index bb6148dc6d24..65e6992d8719 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1971,7 +1971,7 @@ void wakeup_flusher_threads(enum wb_reason reason)

list_for_each_entry_rcu(wb, &bdi->wb_list, bdi_node)
wb_start_writeback(wb, wb_split_bdi_pages(wb, nr_pages),
- false, reason);
+ true, reason);
}
rcu_read_unlock();
}
--
2.7.4

2017-09-27 20:17:00

by Jens Axboe

[permalink] [raw]
Subject: [PATCH 04/12] fs: kill 'nr_pages' argument from wakeup_flusher_threads()

Everybody is passing in 0 now, let's get rid of the argument.

Reviewed-by: Jan Kara <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
---
fs/fs-writeback.c | 9 ++++-----
fs/sync.c | 2 +-
include/linux/writeback.h | 2 +-
mm/vmscan.c | 2 +-
4 files changed, 7 insertions(+), 8 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 245c430a2e41..bb6148dc6d24 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1947,12 +1947,12 @@ void wb_workfn(struct work_struct *work)
}

/*
- * Start writeback of `nr_pages' pages. If `nr_pages' is zero, write back
- * the whole world.
+ * Wakeup the flusher threads to start writeback of all currently dirty pages
*/
-void wakeup_flusher_threads(long nr_pages, enum wb_reason reason)
+void wakeup_flusher_threads(enum wb_reason reason)
{
struct backing_dev_info *bdi;
+ long nr_pages;

/*
* If we are expecting writeback progress we must submit plugged IO.
@@ -1960,8 +1960,7 @@ void wakeup_flusher_threads(long nr_pages, enum wb_reason reason)
if (blk_needs_flush_plug(current))
blk_schedule_flush_plug(current);

- if (!nr_pages)
- nr_pages = get_nr_dirty_pages();
+ nr_pages = get_nr_dirty_pages();

rcu_read_lock();
list_for_each_entry_rcu(bdi, &bdi_list, bdi_list) {
diff --git a/fs/sync.c b/fs/sync.c
index a576aa2e6b09..09f96a18dd93 100644
--- a/fs/sync.c
+++ b/fs/sync.c
@@ -108,7 +108,7 @@ SYSCALL_DEFINE0(sync)
{
int nowait = 0, wait = 1;

- wakeup_flusher_threads(0, WB_REASON_SYNC);
+ wakeup_flusher_threads(WB_REASON_SYNC);
iterate_supers(sync_inodes_one_sb, NULL);
iterate_supers(sync_fs_one_sb, &nowait);
iterate_supers(sync_fs_one_sb, &wait);
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index d5815794416c..1f9c6db5e29a 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -189,7 +189,7 @@ bool try_to_writeback_inodes_sb(struct super_block *, enum wb_reason reason);
bool try_to_writeback_inodes_sb_nr(struct super_block *, unsigned long nr,
enum wb_reason reason);
void sync_inodes_sb(struct super_block *);
-void wakeup_flusher_threads(long nr_pages, enum wb_reason reason);
+void wakeup_flusher_threads(enum wb_reason reason);
void inode_wait_for_writeback(struct inode *inode);

/* writeback.h requires fs.h; it, too, is not included from here. */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 13d711dd8776..42a7fdd52d87 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1867,7 +1867,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
* also allow kswapd to start writing pages during reclaim.
*/
if (stat.nr_unqueued_dirty == nr_taken) {
- wakeup_flusher_threads(0, WB_REASON_VMSCAN);
+ wakeup_flusher_threads(WB_REASON_VMSCAN);
set_bit(PGDAT_DIRTY, &pgdat->flags);
}

--
2.7.4

2017-09-27 20:17:16

by Jens Axboe

[permalink] [raw]
Subject: [PATCH 01/12] buffer: have alloc_page_buffers() use __GFP_NOFAIL

Instead of adding weird retry logic in that function, utilize
__GFP_NOFAIL to ensure that the vm takes care of handling any
potential retries appropriately. This means we don't have to
call free_more_memory() from here.

Signed-off-by: Jens Axboe <[email protected]>
---
drivers/md/bitmap.c | 2 +-
fs/buffer.c | 33 ++++++++++-----------------------
fs/ntfs/aops.c | 2 +-
fs/ntfs/mft.c | 2 +-
include/linux/buffer_head.h | 2 +-
5 files changed, 14 insertions(+), 27 deletions(-)

diff --git a/drivers/md/bitmap.c b/drivers/md/bitmap.c
index d2121637b4ab..4d8ed74efadf 100644
--- a/drivers/md/bitmap.c
+++ b/drivers/md/bitmap.c
@@ -368,7 +368,7 @@ static int read_page(struct file *file, unsigned long index,
pr_debug("read bitmap file (%dB @ %llu)\n", (int)PAGE_SIZE,
(unsigned long long)index << PAGE_SHIFT);

- bh = alloc_page_buffers(page, 1<<inode->i_blkbits, 0);
+ bh = alloc_page_buffers(page, 1<<inode->i_blkbits, false);
if (!bh) {
ret = -ENOMEM;
goto out;
diff --git a/fs/buffer.c b/fs/buffer.c
index 170df856bdb9..1234ae343aef 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -861,16 +861,19 @@ int remove_inode_buffers(struct inode *inode)
* which may not fail from ordinary buffer allocations.
*/
struct buffer_head *alloc_page_buffers(struct page *page, unsigned long size,
- int retry)
+ bool retry)
{
struct buffer_head *bh, *head;
+ gfp_t gfp = GFP_NOFS;
long offset;

-try_again:
+ if (retry)
+ gfp |= __GFP_NOFAIL;
+
head = NULL;
offset = PAGE_SIZE;
while ((offset -= size) >= 0) {
- bh = alloc_buffer_head(GFP_NOFS);
+ bh = alloc_buffer_head(gfp);
if (!bh)
goto no_grow;

@@ -896,23 +899,7 @@ struct buffer_head *alloc_page_buffers(struct page *page, unsigned long size,
} while (head);
}

- /*
- * Return failure for non-async IO requests. Async IO requests
- * are not allowed to fail, so we have to wait until buffer heads
- * become available. But we don't want tasks sleeping with
- * partially complete buffers, so all were released above.
- */
- if (!retry)
- return NULL;
-
- /* We're _really_ low on memory. Now we just
- * wait for old buffer heads to become free due to
- * finishing IO. Since this is an async request and
- * the reserve list is empty, we're sure there are
- * async buffer heads in use.
- */
- free_more_memory();
- goto try_again;
+ return NULL;
}
EXPORT_SYMBOL_GPL(alloc_page_buffers);

@@ -1021,7 +1008,7 @@ grow_dev_page(struct block_device *bdev, sector_t block,
/*
* Allocate some buffers for this page
*/
- bh = alloc_page_buffers(page, size, 0);
+ bh = alloc_page_buffers(page, size, false);
if (!bh)
goto failed;

@@ -1575,7 +1562,7 @@ void create_empty_buffers(struct page *page,
{
struct buffer_head *bh, *head, *tail;

- head = alloc_page_buffers(page, blocksize, 1);
+ head = alloc_page_buffers(page, blocksize, true);
bh = head;
do {
bh->b_state |= b_state;
@@ -2638,7 +2625,7 @@ int nobh_write_begin(struct address_space *mapping,
* Be careful: the buffer linked list is a NULL terminated one, rather
* than the circular one we're used to.
*/
- head = alloc_page_buffers(page, blocksize, 0);
+ head = alloc_page_buffers(page, blocksize, false);
if (!head) {
ret = -ENOMEM;
goto out_release;
diff --git a/fs/ntfs/aops.c b/fs/ntfs/aops.c
index cc91856b5e2d..3a2e509c77c5 100644
--- a/fs/ntfs/aops.c
+++ b/fs/ntfs/aops.c
@@ -1739,7 +1739,7 @@ void mark_ntfs_record_dirty(struct page *page, const unsigned int ofs) {
spin_lock(&mapping->private_lock);
if (unlikely(!page_has_buffers(page))) {
spin_unlock(&mapping->private_lock);
- bh = head = alloc_page_buffers(page, bh_size, 1);
+ bh = head = alloc_page_buffers(page, bh_size, true);
spin_lock(&mapping->private_lock);
if (likely(!page_has_buffers(page))) {
struct buffer_head *tail;
diff --git a/fs/ntfs/mft.c b/fs/ntfs/mft.c
index b6f402194f02..ee8392aee9f6 100644
--- a/fs/ntfs/mft.c
+++ b/fs/ntfs/mft.c
@@ -507,7 +507,7 @@ int ntfs_sync_mft_mirror(ntfs_volume *vol, const unsigned long mft_no,
if (unlikely(!page_has_buffers(page))) {
struct buffer_head *tail;

- bh = head = alloc_page_buffers(page, blocksize, 1);
+ bh = head = alloc_page_buffers(page, blocksize, true);
do {
set_buffer_uptodate(bh);
tail = bh;
diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index c8dae555eccf..ae2d25f01b98 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -156,7 +156,7 @@ void set_bh_page(struct buffer_head *bh,
struct page *page, unsigned long offset);
int try_to_free_buffers(struct page *);
struct buffer_head *alloc_page_buffers(struct page *page, unsigned long size,
- int retry);
+ bool retry);
void create_empty_buffers(struct page *, unsigned long,
unsigned long b_state);
void end_buffer_read_sync(struct buffer_head *bh, int uptodate);
--
2.7.4

2017-09-27 20:23:03

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH 12/12] writeback: kill off ->range_cycle option

On 09/27/2017 10:13 PM, Jens Axboe wrote:
> All types of writeback are now range cyclic. Kill off the member
> from struct wb_writeback_work and struct writeback_control.
>
> Remove the various checks for whether or not this is range cyclic
> writeback or not in the writeback code and file systems.
>
> For tracing, we leave the member there, and just always set it to 1.

Disregard this patch, there are obviously places where we don't set
range_cyclic at all and write out a subset of the dirty pages (or
just one). Needs more looking into, for now just ignore this last
patch in the series.

--
Jens Axboe

2017-09-28 13:19:53

by John Stoffel

[permalink] [raw]
Subject: Re: [PATCH 0/12 v3] Writeback improvements

On Wed, Sep 27, 2017 at 02:13:47PM -0600, Jens Axboe wrote:
> We've had some issues with writeback in presence of memory reclaim
> at Facebook, and this patch set attempts to fix it up. The real
> functional change for that issue is patch 10. The rest are cleanups,
> as well as the removal of doing non-range cyclic writeback. The users
> of that was sync_inodes_sb() and wakeup_flusher_threads(), both of
> which writeback all of the dirty pages.

So does this patch set make things faster? Less bursty? Does it make
writeout take longer, but with less spikes? What is the performance
impact of this change? I hate to be a pain, but this just smacks of
arm waving and I'm sure FB doesn't make changes without data... :-)

> The basic idea is that we have callers that call
> wakeup_flusher_threads() with nr_pages == 0. This means 'writeback
> everything'. For memory reclaim situations, we can end up queuing
> a TON of these kinds of writeback units. This can cause softlockups
> and further memory issues, since we allocate huge amounts of
> struct wb_writeback_work to handle this writeback. Handle this
> situation more gracefully.

Do you push back on the callers or slow them down? Why do we even
allow callers to flush everything?

John

2017-09-28 13:39:59

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH 0/12 v3] Writeback improvements

On 09/28/2017 03:19 PM, John Stoffel wrote:
> On Wed, Sep 27, 2017 at 02:13:47PM -0600, Jens Axboe wrote:
>> We've had some issues with writeback in presence of memory reclaim
>> at Facebook, and this patch set attempts to fix it up. The real
>> functional change for that issue is patch 10. The rest are cleanups,
>> as well as the removal of doing non-range cyclic writeback. The users
>> of that was sync_inodes_sb() and wakeup_flusher_threads(), both of
>> which writeback all of the dirty pages.
>
> So does this patch set make things faster? Less bursty? Does it make
> writeout take longer, but with less spikes? What is the performance
> impact of this change? I hate to be a pain, but this just smacks of
> arm waving and I'm sure FB doesn't make changes without data... :-)

See patch 10, this isn't arm waving at all. The whole point is that
you can have millions of writeback work items, which don't do anything.
See not only are we wasting a full core of doing nothing, it's bad
enough that we can trigger softlockups since it's just sitting there
in a loop doing that.

It's all explained in that patch...

>> The basic idea is that we have callers that call
>> wakeup_flusher_threads() with nr_pages == 0. This means 'writeback
>> everything'. For memory reclaim situations, we can end up queuing
>> a TON of these kinds of writeback units. This can cause softlockups
>> and further memory issues, since we allocate huge amounts of
>> struct wb_writeback_work to handle this writeback. Handle this
>> situation more gracefully.
>
> Do you push back on the callers or slow them down? Why do we even
> allow callers to flush everything?

Ehm, because we have to? There are cases where flushing everything
makes sense. Laptop mode is one of them, the problematic case
here is memory reclaim. To clean dirty pages, you have to kick
the flusher threads.

--
Jens Axboe

2017-09-28 15:26:33

by Nikolay Borisov

[permalink] [raw]
Subject: Re: [PATCH 01/12] buffer: have alloc_page_buffers() use __GFP_NOFAIL



On 27.09.2017 23:13, Jens Axboe wrote:
> Instead of adding weird retry logic in that function, utilize
> __GFP_NOFAIL to ensure that the vm takes care of handling any
> potential retries appropriately. This means we don't have to
> call free_more_memory() from here.
>
> Signed-off-by: Jens Axboe <[email protected]>

Reviewed-by: Nikolay Borisov <[email protected]>
> ---
> drivers/md/bitmap.c | 2 +-
> fs/buffer.c | 33 ++++++++++-----------------------
> fs/ntfs/aops.c | 2 +-
> fs/ntfs/mft.c | 2 +-
> include/linux/buffer_head.h | 2 +-
> 5 files changed, 14 insertions(+), 27 deletions(-)
>
> diff --git a/drivers/md/bitmap.c b/drivers/md/bitmap.c
> index d2121637b4ab..4d8ed74efadf 100644
> --- a/drivers/md/bitmap.c
> +++ b/drivers/md/bitmap.c
> @@ -368,7 +368,7 @@ static int read_page(struct file *file, unsigned long index,
> pr_debug("read bitmap file (%dB @ %llu)\n", (int)PAGE_SIZE,
> (unsigned long long)index << PAGE_SHIFT);
>
> - bh = alloc_page_buffers(page, 1<<inode->i_blkbits, 0);
> + bh = alloc_page_buffers(page, 1<<inode->i_blkbits, false);
> if (!bh) {
> ret = -ENOMEM;
> goto out;
> diff --git a/fs/buffer.c b/fs/buffer.c
> index 170df856bdb9..1234ae343aef 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -861,16 +861,19 @@ int remove_inode_buffers(struct inode *inode)
> * which may not fail from ordinary buffer allocations.
> */
> struct buffer_head *alloc_page_buffers(struct page *page, unsigned long size,
> - int retry)
> + bool retry)
> {
> struct buffer_head *bh, *head;
> + gfp_t gfp = GFP_NOFS;
> long offset;
>
> -try_again:
> + if (retry)
> + gfp |= __GFP_NOFAIL;
> +
> head = NULL;
> offset = PAGE_SIZE;
> while ((offset -= size) >= 0) {
> - bh = alloc_buffer_head(GFP_NOFS);
> + bh = alloc_buffer_head(gfp);
> if (!bh)
> goto no_grow;
>
> @@ -896,23 +899,7 @@ struct buffer_head *alloc_page_buffers(struct page *page, unsigned long size,
> } while (head);
> }
>
> - /*
> - * Return failure for non-async IO requests. Async IO requests
> - * are not allowed to fail, so we have to wait until buffer heads
> - * become available. But we don't want tasks sleeping with
> - * partially complete buffers, so all were released above.
> - */
> - if (!retry)
> - return NULL;
> -
> - /* We're _really_ low on memory. Now we just
> - * wait for old buffer heads to become free due to
> - * finishing IO. Since this is an async request and
> - * the reserve list is empty, we're sure there are
> - * async buffer heads in use.
> - */
> - free_more_memory();
> - goto try_again;
> + return NULL;
> }
> EXPORT_SYMBOL_GPL(alloc_page_buffers);
>
> @@ -1021,7 +1008,7 @@ grow_dev_page(struct block_device *bdev, sector_t block,
> /*
> * Allocate some buffers for this page
> */
> - bh = alloc_page_buffers(page, size, 0);
> + bh = alloc_page_buffers(page, size, false);
> if (!bh)
> goto failed;
>
> @@ -1575,7 +1562,7 @@ void create_empty_buffers(struct page *page,
> {
> struct buffer_head *bh, *head, *tail;
>
> - head = alloc_page_buffers(page, blocksize, 1);
> + head = alloc_page_buffers(page, blocksize, true);
> bh = head;
> do {
> bh->b_state |= b_state;
> @@ -2638,7 +2625,7 @@ int nobh_write_begin(struct address_space *mapping,
> * Be careful: the buffer linked list is a NULL terminated one, rather
> * than the circular one we're used to.
> */
> - head = alloc_page_buffers(page, blocksize, 0);
> + head = alloc_page_buffers(page, blocksize, false);
> if (!head) {
> ret = -ENOMEM;
> goto out_release;
> diff --git a/fs/ntfs/aops.c b/fs/ntfs/aops.c
> index cc91856b5e2d..3a2e509c77c5 100644
> --- a/fs/ntfs/aops.c
> +++ b/fs/ntfs/aops.c
> @@ -1739,7 +1739,7 @@ void mark_ntfs_record_dirty(struct page *page, const unsigned int ofs) {
> spin_lock(&mapping->private_lock);
> if (unlikely(!page_has_buffers(page))) {
> spin_unlock(&mapping->private_lock);
> - bh = head = alloc_page_buffers(page, bh_size, 1);
> + bh = head = alloc_page_buffers(page, bh_size, true);
> spin_lock(&mapping->private_lock);
> if (likely(!page_has_buffers(page))) {
> struct buffer_head *tail;
> diff --git a/fs/ntfs/mft.c b/fs/ntfs/mft.c
> index b6f402194f02..ee8392aee9f6 100644
> --- a/fs/ntfs/mft.c
> +++ b/fs/ntfs/mft.c
> @@ -507,7 +507,7 @@ int ntfs_sync_mft_mirror(ntfs_volume *vol, const unsigned long mft_no,
> if (unlikely(!page_has_buffers(page))) {
> struct buffer_head *tail;
>
> - bh = head = alloc_page_buffers(page, blocksize, 1);
> + bh = head = alloc_page_buffers(page, blocksize, true);
> do {
> set_buffer_uptodate(bh);
> tail = bh;
> diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
> index c8dae555eccf..ae2d25f01b98 100644
> --- a/include/linux/buffer_head.h
> +++ b/include/linux/buffer_head.h
> @@ -156,7 +156,7 @@ void set_bh_page(struct buffer_head *bh,
> struct page *page, unsigned long offset);
> int try_to_free_buffers(struct page *);
> struct buffer_head *alloc_page_buffers(struct page *page, unsigned long size,
> - int retry);
> + bool retry);
> void create_empty_buffers(struct page *, unsigned long,
> unsigned long b_state);
> void end_buffer_read_sync(struct buffer_head *bh, int uptodate);
>

2017-09-28 15:26:32

by Nikolay Borisov

[permalink] [raw]
Subject: Re: [PATCH 02/12] buffer: grow_dev_page() should use __GFP_NOFAIL for all cases



On 27.09.2017 23:13, Jens Axboe wrote:
> We currently it it for find_or_create_page(), which means that it

nit: Perhaps you wanted to write "We currently use it for find_..."

otherwise:

Reviewed-by: Nikolay Borisov <[email protected]>

> cannot fail. Ensure we also pass in 'retry == true' to
> alloc_page_buffers(), which also ensure that it cannot fail.
>
> After this, there are no failure cases in grow_dev_page() that
> occur because of a failed memory allocation.
>
> Signed-off-by: Jens Axboe <[email protected]>
> ---
> fs/buffer.c | 6 +-----
> 1 file changed, 1 insertion(+), 5 deletions(-)
>
> diff --git a/fs/buffer.c b/fs/buffer.c
> index 1234ae343aef..3b60cd8456db 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -988,8 +988,6 @@ grow_dev_page(struct block_device *bdev, sector_t block,
> gfp_mask |= __GFP_NOFAIL;
>
> page = find_or_create_page(inode->i_mapping, index, gfp_mask);
> - if (!page)
> - return ret;
>
> BUG_ON(!PageLocked(page));
>
> @@ -1008,9 +1006,7 @@ grow_dev_page(struct block_device *bdev, sector_t block,
> /*
> * Allocate some buffers for this page
> */
> - bh = alloc_page_buffers(page, size, false);
> - if (!bh)
> - goto failed;
> + bh = alloc_page_buffers(page, size, true);
>
> /*
> * Link the page to the buffers and initialise them. Take the
>

2017-09-28 15:31:30

by Nikolay Borisov

[permalink] [raw]
Subject: Re: [PATCH 03/12] buffer: eliminate the need to call free_more_memory() in __getblk_slow()



On 27.09.2017 23:13, Jens Axboe wrote:
> Since the previous commit removed any case where grow_buffers()
> would return failure due to memory allocations, we can safely
> remove the case where we have to call free_more_memory() in
> this function.
>
> Since this is also the last user of free_more_memory(), kill
> it off completely.
>
> Signed-off-by: Jens Axboe <[email protected]>

Reviewed-by: Nikolay Borisov <[email protected]>

> ---
> fs/buffer.c | 23 -----------------------
> 1 file changed, 23 deletions(-)
>
> diff --git a/fs/buffer.c b/fs/buffer.c
> index 3b60cd8456db..bff571dc7bc3 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -253,27 +253,6 @@ __find_get_block_slow(struct block_device *bdev, sector_t block)
> }
>
> /*
> - * Kick the writeback threads then try to free up some ZONE_NORMAL memory.
> - */
> -static void free_more_memory(void)
> -{
> - struct zoneref *z;
> - int nid;
> -
> - wakeup_flusher_threads(1024, WB_REASON_FREE_MORE_MEM);
> - yield();
> -
> - for_each_online_node(nid) {
> -
> - z = first_zones_zonelist(node_zonelist(nid, GFP_NOFS),
> - gfp_zone(GFP_NOFS), NULL);
> - if (z->zone)
> - try_to_free_pages(node_zonelist(nid, GFP_NOFS), 0,
> - GFP_NOFS, NULL);
> - }
> -}
> -
> -/*
> * I/O completion handler for block_read_full_page() - pages
> * which come unlocked at the end of I/O.
> */
> @@ -1086,8 +1065,6 @@ __getblk_slow(struct block_device *bdev, sector_t block,
> ret = grow_buffers(bdev, block, size, gfp);
> if (ret < 0)
> return NULL;
> - if (ret == 0)
> - free_more_memory();
> }
> }
>
>

2017-09-28 18:12:37

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH 02/12] buffer: grow_dev_page() should use __GFP_NOFAIL for all cases

On 09/28/2017 04:11 PM, Nikolay Borisov wrote:
>
>
> On 27.09.2017 23:13, Jens Axboe wrote:
>> We currently it it for find_or_create_page(), which means that it
>
> nit: Perhaps you wanted to write "We currently use it for find_..."
>
> otherwise:
>
> Reviewed-by: Nikolay Borisov <[email protected]>

Yeah, fixed up, thanks.

--
Jens Axboe

2017-09-28 21:41:03

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 10/12] writeback: only allow one inflight and pending full flush

On Wed, 27 Sep 2017 14:13:57 -0600 Jens Axboe <[email protected]> wrote:

> When someone calls wakeup_flusher_threads() or
> wakeup_flusher_threads_bdi(), they schedule writeback of all dirty
> pages in the system (or on that bdi). If we are tight on memory, we
> can get tons of these queued from kswapd/vmscan. This causes (at
> least) two problems:
>
> 1) We consume a ton of memory just allocating writeback work items.
> We've seen as much as 600 million of these writeback work items
> pending. That's a lot of memory to pointlessly hold hostage,
> while the box is under memory pressure.
>
> 2) We spend so much time processing these work items, that we
> introduce a softlockup in writeback processing. This is because
> each of the writeback work items don't end up doing any work (it's
> hard when you have millions of identical ones coming in to the
> flush machinery), so we just sit in a tight loop pulling work
> items and deleting/freeing them.
>
> Fix this by adding a 'start_all' bit to the writeback structure, and
> set that when someone attempts to flush all dirty pages. The bit is
> cleared when we start writeback on that work item. If the bit is
> already set when we attempt to queue !nr_pages writeback, then we
> simply ignore it.
>
> This provides us one full flush in flight, with one pending as well,
> and makes for more efficient handling of this type of writeback.
>
> ...
>
> @@ -953,12 +954,27 @@ static void wb_start_writeback(struct bdi_writeback *wb, bool range_cyclic,
> return;
>
> /*
> + * All callers of this function want to start writeback of all
> + * dirty pages. Places like vmscan can call this at a very
> + * high frequency, causing pointless allocations of tons of
> + * work items and keeping the flusher threads busy retrieving
> + * that work. Ensure that we only allow one of them pending and
> + * inflight at the time. It doesn't matter if we race a little
> + * bit on this, so use the faster separate test/set bit variants.
> + */
> + if (test_bit(WB_start_all, &wb->state))
> + return;
> +
> + set_bit(WB_start_all, &wb->state);

test_and_set_bit()?


2017-09-28 21:44:18

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 10/12] writeback: only allow one inflight and pending full flush

On Thu, Sep 28, 2017 at 2:41 PM, Andrew Morton
<[email protected]> wrote:
>
> test_and_set_bit()?

If there aren't any atomicity concerns (either because of higher-level
locking, or because racing and having two people set the bit is fine),
it can be better to do them separately if the test_bit() is the common
case and you can avoid dirtying a cacheline that way.

But yeah, if that is the case, it might be worth documenting, because
test_and_set_bit() is the more obviously appropriate "there can be
only one" model.

Linus

2017-09-29 00:16:04

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH 10/12] writeback: only allow one inflight and pending full flush

On 09/28/2017 11:41 PM, Andrew Morton wrote:
> On Wed, 27 Sep 2017 14:13:57 -0600 Jens Axboe <[email protected]> wrote:
>
>> When someone calls wakeup_flusher_threads() or
>> wakeup_flusher_threads_bdi(), they schedule writeback of all dirty
>> pages in the system (or on that bdi). If we are tight on memory, we
>> can get tons of these queued from kswapd/vmscan. This causes (at
>> least) two problems:
>>
>> 1) We consume a ton of memory just allocating writeback work items.
>> We've seen as much as 600 million of these writeback work items
>> pending. That's a lot of memory to pointlessly hold hostage,
>> while the box is under memory pressure.
>>
>> 2) We spend so much time processing these work items, that we
>> introduce a softlockup in writeback processing. This is because
>> each of the writeback work items don't end up doing any work (it's
>> hard when you have millions of identical ones coming in to the
>> flush machinery), so we just sit in a tight loop pulling work
>> items and deleting/freeing them.
>>
>> Fix this by adding a 'start_all' bit to the writeback structure, and
>> set that when someone attempts to flush all dirty pages. The bit is
>> cleared when we start writeback on that work item. If the bit is
>> already set when we attempt to queue !nr_pages writeback, then we
>> simply ignore it.
>>
>> This provides us one full flush in flight, with one pending as well,
>> and makes for more efficient handling of this type of writeback.
>>
>> ...
>>
>> @@ -953,12 +954,27 @@ static void wb_start_writeback(struct bdi_writeback *wb, bool range_cyclic,
>> return;
>>
>> /*
>> + * All callers of this function want to start writeback of all
>> + * dirty pages. Places like vmscan can call this at a very
>> + * high frequency, causing pointless allocations of tons of
>> + * work items and keeping the flusher threads busy retrieving
>> + * that work. Ensure that we only allow one of them pending and
>> + * inflight at the time. It doesn't matter if we race a little
>> + * bit on this, so use the faster separate test/set bit variants.
>> + */
>> + if (test_bit(WB_start_all, &wb->state))
>> + return;
>> +
>> + set_bit(WB_start_all, &wb->state);
>
> test_and_set_bit()?

Like Linus says, this is done purposely. I've even included a bit about
it in the comment above, though maybe it's not clear enough. I've used
this trick in blk-mq quite a bit as well, and for high frequency calls,
it can make a substantial difference not to redirty that cache line if
you can avoid it.

If you do care about atomicity, this works really well too:

if (test_bit(bit, addr) || test_and_set_bit(bit, addr))
...

just to avoid the locked operation. Also see this commit:
commit 7fcbbaf18392f0b17c95e2f033c8ccf87eecde1d
Author: Jens Axboe <[email protected]>
Date: Thu May 22 11:54:16 2014 -0700

mm/filemap.c: avoid always dirtying mapping->flags on O_DIRECT

where there are some actual numbers on a specific case.

For the case at hand, we don't even need to do the test_and_set
case, since we don't care about a small race there.

--
Jens Axboe

2017-09-29 00:17:38

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH 10/12] writeback: only allow one inflight and pending full flush

On 09/28/2017 11:44 PM, Linus Torvalds wrote:
> On Thu, Sep 28, 2017 at 2:41 PM, Andrew Morton
> <[email protected]> wrote:
>>
>> test_and_set_bit()?
>
> If there aren't any atomicity concerns (either because of higher-level
> locking, or because racing and having two people set the bit is fine),
> it can be better to do them separately if the test_bit() is the common
> case and you can avoid dirtying a cacheline that way.
>
> But yeah, if that is the case, it might be worth documenting, because
> test_and_set_bit() is the more obviously appropriate "there can be
> only one" model.

It is documented though, but maybe not well enough...

I've actually had to document/explain it enough times now, that it
might be worth making a general construct. Though it has to be
used carefully, so perhaps it's better contained as separate use
cases.

--
Jens Axboe

2017-09-29 05:21:35

by Amir Goldstein

[permalink] [raw]
Subject: Re: [PATCH 10/12] writeback: only allow one inflight and pending full flush

On Fri, Sep 29, 2017 at 3:17 AM, Jens Axboe <[email protected]> wrote:
> On 09/28/2017 11:44 PM, Linus Torvalds wrote:
>> On Thu, Sep 28, 2017 at 2:41 PM, Andrew Morton
>> <[email protected]> wrote:
>>>
>>> test_and_set_bit()?
>>
>> If there aren't any atomicity concerns (either because of higher-level
>> locking, or because racing and having two people set the bit is fine),
>> it can be better to do them separately if the test_bit() is the common
>> case and you can avoid dirtying a cacheline that way.
>>
>> But yeah, if that is the case, it might be worth documenting, because
>> test_and_set_bit() is the more obviously appropriate "there can be
>> only one" model.
>
> It is documented though, but maybe not well enough...
>
> I've actually had to document/explain it enough times now, that it
> might be worth making a general construct. Though it has to be
> used carefully, so perhaps it's better contained as separate use
> cases.
>

Maybe change "Ensure that we only allow one of them pending"
in the comment above. Only the "allow one inflight" part is correct.

Or apply your follow up patch and be done with in...

Amir.