2016-03-30 15:08:12

by Jens Axboe

[permalink] [raw]
Subject: [PATCHSET v3][RFC] Make background writeback not suck

Hi,

This patchset isn't as much a final solution, as it's demonstration
of what I believe is a huge issue. Since the dawn of time, our
background buffered writeback has sucked. When we do background
buffered writeback, it should have little impact on foreground
activity. That's the definition of background activity... But for as
long as I can remember, heavy buffered writers has not behaved like
that. For instance, if I do something like this:

$ dd if=/dev/zero of=foo bs=1M count=10k

on my laptop, and then try and start chrome, it basically won't start
before the buffered writeback is done. Or, for server oriented
workloads, where installation of a big RPM (or similar) adversely
impacts data base reads or sync writes. When that happens, I get people
yelling at me.

Last time I posted this, I used flash storage as the example. But
this works equally well on rotating storage. Let's run a test case
that writes a lot. This test writes 50 files, each 100M, on XFS on
a regular hard drive. While this happens, we attempt to read
another file with fio.

Writers:

$ time (./write-files ; sync)
real 1m6.304s
user 0m0.020s
sys 0m12.210s

Fio reader:

read : io=35580KB, bw=550868B/s, iops=134, runt= 66139msec
clat (usec): min=40, max=654204, avg=7432.37, stdev=43872.83
lat (usec): min=40, max=654204, avg=7432.70, stdev=43872.83
clat percentiles (usec):
| 1.00th=[ 41], 5.00th=[ 41], 10.00th=[ 41], 20.00th=[ 42],
| 30.00th=[ 42], 40.00th=[ 42], 50.00th=[ 43], 60.00th=[ 52],
| 70.00th=[ 59], 80.00th=[ 65], 90.00th=[ 87], 95.00th=[ 1192],
| 99.00th=[254976], 99.50th=[358400], 99.90th=[444416], 99.95th=[468992],
| 99.99th=[651264]


Let's run the same test, but with the patches applied, and wb_percent
set to 10%:

Writers:

$ time (./write-files ; sync)
real 1m29.384s
user 0m0.040s
sys 0m10.810s

Fio reader:

read : io=1024.0MB, bw=18640KB/s, iops=4660, runt= 56254msec
clat (usec): min=39, max=408400, avg=212.05, stdev=2982.44
lat (usec): min=39, max=408400, avg=212.30, stdev=2982.44
clat percentiles (usec):
| 1.00th=[ 40], 5.00th=[ 41], 10.00th=[ 41], 20.00th=[ 41],
| 30.00th=[ 42], 40.00th=[ 42], 50.00th=[ 42], 60.00th=[ 42],
| 70.00th=[ 43], 80.00th=[ 45], 90.00th=[ 56], 95.00th=[ 60],
| 99.00th=[ 454], 99.50th=[ 8768], 99.90th=[36608], 99.95th=[43264],
| 99.99th=[69120]


Much better, looking at the P99.x percentiles, and of course on
the bandwidth front as well. It's the difference between this:

---io---- -system-- ------cpu-----
bi bo in cs us sy id wa st
20636 45056 5593 10833 0 0 94 6 0
16416 46080 4484 8666 0 0 94 6 0
16960 47104 5183 8936 0 0 94 6 0

and this

---io---- -system-- ------cpu-----
bi bo in cs us sy id wa st
384 73728 571 558 0 0 95 5 0
384 73728 548 545 0 0 95 5 0
388 73728 575 763 0 0 96 4 0

in the vmstat output. It's not quite as bad as deeper queue depth
devices, where we have hugely bursty IO, but it's still very slow.

If we don't run the competing reader, the dirty data writeback proceeds
at normal rates:

# time (./write-files ; sync)
real 1m6.919s
user 0m0.010s
sys 0m10.900s


The above was run without scsi-mq, and with using the deadline scheduler,
results with CFQ are similary depressing for this test. So IO scheduling
is in place for this test, it's not pure blk-mq without scheduling.

The above was the why. The how is basically throttling background
writeback. We still want to issue big writes from the vm side of things,
so we get nice and big extents on the file system end. But we don't need
to flood the device with THOUSANDS of requests for background writeback.
For most devices, we don't need a whole lot to get decent throughput.

This adds some simple blk-wb code that keeps limits how much buffered
writeback we keep in flight on the device end. The default is pretty
low. If we end up switching to WB_SYNC_ALL, we up the limits. If the
dirtying task ends up being throttled in balance_dirty_pages(), we up
the limit. If we need to reclaim memory, we up the limit. The cases
that need to clean memory at or near device speeds, they get to do
that. We still don't need thousands of requests to accomplish that.
And for the cases where we don't need to be near device limits, we
can clean at a more reasonable pace. See the last patch in the series
for a more detailed description of the change, and the tunable.

I welcome testing. If you are sick of Linux bogging down when buffered
writes are happening, then this is for you, laptop or server. The
patchset is fully stable, I have not observed problems. It passes full
xfstest runs, and a variety of benchmarks as well. It works equally well
on blk-mq/scsi-mq, and "classic" setups.

You can also find this in a branch in the block git repo:

git://git.kernel.dk/linux-block.git wb-buf-throttle

Note that I rebase this branch when I collapse patches. Patches are
against current Linus' git, 4.6.0-rc1, I can make them available
against 4.5 as well, if there's any interest in that for test
purposes.

Changes since v2

- Switch from wb_depth to wb_percent, as that's an easier tunable.
- Add the patch to track device depth on the block layer side.
- Cleanup the limiting code.
- Don't use a fixed limit in the wb wait, since it can change
between wakeups.
- Minor tweaks, fixups, cleanups.

Changes since v1

- Drop sync() WB_SYNC_NONE -> WB_SYNC_ALL change
- wb_start_writeback() fills in background/reclaim/sync info in
the writeback work, based on writeback reason.
- Use WRITE_SYNC for reclaim/sync IO
- Split balance_dirty_pages() sleep change into separate patch
- Drop get_request() u64 flag change, set the bit on the request
directly after-the-fact.
- Fix wrong sysfs return value
- Various small cleanups


block/Makefile | 2
block/blk-core.c | 15 ++
block/blk-mq.c | 31 ++++-
block/blk-settings.c | 20 +++
block/blk-sysfs.c | 128 ++++++++++++++++++++
block/blk-wb.c | 238 +++++++++++++++++++++++++++++++++++++++
block/blk-wb.h | 33 +++++
drivers/nvme/host/core.c | 1
drivers/scsi/scsi.c | 3
drivers/scsi/sd.c | 5
fs/block_dev.c | 2
fs/buffer.c | 2
fs/f2fs/data.c | 2
fs/f2fs/node.c | 2
fs/fs-writeback.c | 13 ++
fs/gfs2/meta_io.c | 3
fs/mpage.c | 9 -
fs/xfs/xfs_aops.c | 2
include/linux/backing-dev-defs.h | 2
include/linux/blk_types.h | 2
include/linux/blkdev.h | 18 ++
include/linux/writeback.h | 8 +
mm/page-writeback.c | 2
23 files changed, 527 insertions(+), 16 deletions(-)


--
Jens Axboe


2016-03-30 15:08:13

by Jens Axboe

[permalink] [raw]
Subject: [PATCH 1/9] writeback: propagate the various reasons for writeback

Avoid losing context by propagating the various reason why we
initiate writeback. If we are doing more important reclaim or
synchronous writeback, the lower levels should know about it.

Signed-off-by: Jens Axboe <[email protected]>
---
fs/fs-writeback.c | 13 +++++++++++++
1 file changed, 13 insertions(+)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index fee81e8768c9..4300ee7b1139 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -52,6 +52,7 @@ struct wb_writeback_work {
unsigned int range_cyclic:1;
unsigned int for_background:1;
unsigned int for_sync:1; /* sync(2) WB_SYNC_ALL writeback */
+ unsigned int for_reclaim:1; /* for mem reclaim */
unsigned int auto_free:1; /* free on completion */
enum wb_reason reason; /* why was writeback initiated? */

@@ -944,6 +945,17 @@ void wb_start_writeback(struct bdi_writeback *wb, long nr_pages,
work->reason = reason;
work->auto_free = 1;

+ switch (reason) {
+ case WB_REASON_TRY_TO_FREE_PAGES:
+ case WB_REASON_FREE_MORE_MEM:
+ work->for_reclaim = 1;
+ case WB_REASON_SYNC:
+ work->for_sync = 1;
+ break;
+ default:
+ break;
+ }
+
wb_queue_work(wb, work);
}

@@ -1446,6 +1458,7 @@ static long writeback_sb_inodes(struct super_block *sb,
.for_kupdate = work->for_kupdate,
.for_background = work->for_background,
.for_sync = work->for_sync,
+ .for_reclaim = work->for_reclaim,
.range_cyclic = work->range_cyclic,
.range_start = 0,
.range_end = LLONG_MAX,
--
2.8.0.rc4.6.g7e4ba36

2016-03-30 15:08:22

by Jens Axboe

[permalink] [raw]
Subject: [PATCH 9/9] writeback: throttle buffered writeback

Test patch that throttles buffered writeback to make it a lot
more smooth, and has way less impact on other system activity.
Background writeback should be, by definition, background
activity. The fact that we flush huge bundles of it at the time
means that it potentially has heavy impacts on foreground workloads,
which isn't ideal. We can't easily limit the sizes of writes that
we do, since that would impact file system layout in the presence
of delayed allocation. So just throttle back buffered writeback,
unless someone is waiting for it.

This is just a test patch, and as such, it registers a queue sysfs
entry to both monitor the current state:

$ cat /sys/block/nvme0n1/queue/wb_stats
idle=16, normal=32, max=64, inflight=0, wait=0, timer=0, bdp_wait=0

'idle' denotes how many requests we will allow inflight for idle
buffered writeback, 'normal' for higher priority writeback, and 'max'
for when it's urgent we clean pages. The values are calculated based
on the queue depth of the device, and the 'wb_percent' setting. If
'wb_percent' is set to zero, the functionality is turned off.

'inflight' shows how many requests are currently inflight for buffered
writeback, 'wait' shows if anyone is currently waiting for access,
'timer' shows if we have processes being deferred in write back cache
timeout, and bdp_wait shows if someone is currently throttled on this
device in balance_dirty_pages().

Finally, if the device has write back caching, 'wb_cache_delay' delays
by this amount of usecs when a write completes before allowing more.

It'd be nice to auto-tune 'wb_percent' based on device response. Flash
is less picky than rotating storage, but still needs throttling. For
flash storage, a wb_percent setting of 50% gives good read latencies
while still having good write bandwidth. For rotating storage, lower
settings (like 10-15%) are more reasonable.

Signed-off-by: Jens Axboe <[email protected]>
---
block/Makefile | 2 +-
block/blk-core.c | 15 +++
block/blk-mq.c | 31 +++++-
block/blk-settings.c | 3 +
block/blk-sysfs.c | 89 +++++++++++++++++
block/blk-wb.c | 238 ++++++++++++++++++++++++++++++++++++++++++++++
block/blk-wb.h | 33 +++++++
include/linux/blk_types.h | 2 +
include/linux/blkdev.h | 3 +
9 files changed, 413 insertions(+), 3 deletions(-)
create mode 100644 block/blk-wb.c
create mode 100644 block/blk-wb.h

diff --git a/block/Makefile b/block/Makefile
index 9eda2322b2d4..9df911a3b569 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -5,7 +5,7 @@
obj-$(CONFIG_BLOCK) := bio.o elevator.o blk-core.o blk-tag.o blk-sysfs.o \
blk-flush.o blk-settings.o blk-ioc.o blk-map.o \
blk-exec.o blk-merge.o blk-softirq.o blk-timeout.o \
- blk-lib.o blk-mq.o blk-mq-tag.o \
+ blk-lib.o blk-mq.o blk-mq-tag.o blk-wb.o \
blk-mq-sysfs.o blk-mq-cpu.o blk-mq-cpumap.o ioctl.o \
genhd.o scsi_ioctl.o partition-generic.o ioprio.o \
badblocks.o partitions/
diff --git a/block/blk-core.c b/block/blk-core.c
index 827f8badd143..85a92cd6047b 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -39,6 +39,7 @@

#include "blk.h"
#include "blk-mq.h"
+#include "blk-wb.h"

EXPORT_TRACEPOINT_SYMBOL_GPL(block_bio_remap);
EXPORT_TRACEPOINT_SYMBOL_GPL(block_rq_remap);
@@ -863,6 +864,9 @@ blk_init_allocated_queue(struct request_queue *q, request_fn_proc *rfn,
*/
blk_queue_make_request(q, blk_queue_bio);

+ if (blk_wb_init(q))
+ goto fail;
+
q->sg_reserved_size = INT_MAX;

/* Protect q->elevator from elevator_change */
@@ -880,6 +884,7 @@ blk_init_allocated_queue(struct request_queue *q, request_fn_proc *rfn,

fail:
blk_free_flush_queue(q->fq);
+ blk_wb_exit(q);
return NULL;
}
EXPORT_SYMBOL(blk_init_allocated_queue);
@@ -1485,6 +1490,8 @@ void __blk_put_request(struct request_queue *q, struct request *req)
/* this is a bio leak */
WARN_ON(req->bio != NULL);

+ blk_wb_done(q->rq_wb, req);
+
/*
* Request may not have originated from ll_rw_blk. if not,
* it didn't come out of our reserved rq pools
@@ -1714,6 +1721,7 @@ static blk_qc_t blk_queue_bio(struct request_queue *q, struct bio *bio)
int el_ret, rw_flags, where = ELEVATOR_INSERT_SORT;
struct request *req;
unsigned int request_count = 0;
+ bool wb_acct;

/*
* low level driver can indicate that it wants pages above a
@@ -1766,6 +1774,8 @@ static blk_qc_t blk_queue_bio(struct request_queue *q, struct bio *bio)
}

get_rq:
+ wb_acct = blk_wb_wait(q->rq_wb, bio, q->queue_lock);
+
/*
* This sync check and mask will be re-done in init_request_from_bio(),
* but we need to set it earlier to expose the sync flag to the
@@ -1781,11 +1791,16 @@ get_rq:
*/
req = get_request(q, rw_flags, bio, GFP_NOIO);
if (IS_ERR(req)) {
+ if (wb_acct)
+ __blk_wb_done(q->rq_wb);
bio->bi_error = PTR_ERR(req);
bio_endio(bio);
goto out_unlock;
}

+ if (wb_acct)
+ req->cmd_flags |= REQ_BUF_INFLIGHT;
+
/*
* After dropping the lock and possibly sleeping here, our request
* may now be mergeable after it had proven unmergeable (above).
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 1699baf39b78..437cdc9b429c 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -29,6 +29,7 @@
#include "blk.h"
#include "blk-mq.h"
#include "blk-mq-tag.h"
+#include "blk-wb.h"

static DEFINE_MUTEX(all_q_mutex);
static LIST_HEAD(all_q_list);
@@ -274,6 +275,9 @@ static void __blk_mq_free_request(struct blk_mq_hw_ctx *hctx,

if (rq->cmd_flags & REQ_MQ_INFLIGHT)
atomic_dec(&hctx->nr_active);
+
+ blk_wb_done(q->rq_wb, rq);
+
rq->cmd_flags = 0;

clear_bit(REQ_ATOM_STARTED, &rq->atomic_flags);
@@ -1253,6 +1257,7 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
struct blk_plug *plug;
struct request *same_queue_rq = NULL;
blk_qc_t cookie;
+ bool wb_acct;

blk_queue_bounce(q, &bio);

@@ -1270,9 +1275,17 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
} else
request_count = blk_plug_queued_count(q);

+ wb_acct = blk_wb_wait(q->rq_wb, bio, NULL);
+
rq = blk_mq_map_request(q, bio, &data);
- if (unlikely(!rq))
+ if (unlikely(!rq)) {
+ if (wb_acct)
+ __blk_wb_done(q->rq_wb);
return BLK_QC_T_NONE;
+ }
+
+ if (wb_acct)
+ rq->cmd_flags |= REQ_BUF_INFLIGHT;

cookie = blk_tag_to_qc_t(rq->tag, data.hctx->queue_num);

@@ -1349,6 +1362,7 @@ static blk_qc_t blk_sq_make_request(struct request_queue *q, struct bio *bio)
struct blk_map_ctx data;
struct request *rq;
blk_qc_t cookie;
+ bool wb_acct;

blk_queue_bounce(q, &bio);

@@ -1363,9 +1377,17 @@ static blk_qc_t blk_sq_make_request(struct request_queue *q, struct bio *bio)
blk_attempt_plug_merge(q, bio, &request_count, NULL))
return BLK_QC_T_NONE;

+ wb_acct = blk_wb_wait(q->rq_wb, bio, NULL);
+
rq = blk_mq_map_request(q, bio, &data);
- if (unlikely(!rq))
+ if (unlikely(!rq)) {
+ if (wb_acct)
+ __blk_wb_done(q->rq_wb);
return BLK_QC_T_NONE;
+ }
+
+ if (wb_acct)
+ rq->cmd_flags |= REQ_BUF_INFLIGHT;

cookie = blk_tag_to_qc_t(rq->tag, data.hctx->queue_num);

@@ -2062,6 +2084,9 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
*/
q->nr_requests = set->queue_depth;

+ if (blk_wb_init(q))
+ goto err_hctxs;
+
if (set->ops->complete)
blk_queue_softirq_done(q, set->ops->complete);

@@ -2097,6 +2122,8 @@ void blk_mq_free_queue(struct request_queue *q)
list_del_init(&q->all_q_node);
mutex_unlock(&all_q_mutex);

+ blk_wb_exit(q);
+
blk_mq_del_queue_tag_set(q);

blk_mq_exit_hw_queues(q, set, set->nr_hw_queues);
diff --git a/block/blk-settings.c b/block/blk-settings.c
index 06e01682f827..bd713a8aa755 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -13,6 +13,7 @@
#include <linux/gfp.h>

#include "blk.h"
+#include "blk-wb.h"

unsigned long blk_max_low_pfn;
EXPORT_SYMBOL(blk_max_low_pfn);
@@ -823,6 +824,8 @@ EXPORT_SYMBOL(blk_queue_update_dma_alignment);
void blk_set_queue_depth(struct request_queue *q, unsigned int depth)
{
q->queue_depth = depth;
+ if (q->rq_wb)
+ blk_wb_update_limits(q->rq_wb, depth);
}
EXPORT_SYMBOL(blk_set_queue_depth);

diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 954e510452d7..2afd5cb8f003 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -13,6 +13,7 @@

#include "blk.h"
#include "blk-mq.h"
+#include "blk-wb.h"

struct queue_sysfs_entry {
struct attribute attr;
@@ -347,6 +348,76 @@ static ssize_t queue_poll_store(struct request_queue *q, const char *page,
return ret;
}

+static ssize_t queue_wb_stats_show(struct request_queue *q, char *page)
+{
+ struct rq_wb *wb = q->rq_wb;
+
+ if (!q->rq_wb)
+ return -EINVAL;
+
+ return sprintf(page, "idle=%d, normal=%d, max=%d, inflight=%d, wait=%d,"
+ " timer=%d, bdp_wait=%d\n", wb->wb_idle,
+ wb->wb_normal, wb->wb_max,
+ atomic_read(&wb->inflight),
+ waitqueue_active(&wb->wait),
+ timer_pending(&wb->timer),
+ *wb->bdp_wait);
+}
+
+static ssize_t queue_wb_perc_show(struct request_queue *q, char *page)
+{
+ if (!q->rq_wb)
+ return -EINVAL;
+
+ return queue_var_show(q->rq_wb->perc, page);
+}
+
+static ssize_t queue_wb_perc_store(struct request_queue *q, const char *page,
+ size_t count)
+{
+ unsigned long perc;
+ ssize_t ret;
+
+ if (!q->rq_wb)
+ return -EINVAL;
+
+ ret = queue_var_store(&perc, page, count);
+ if (ret < 0)
+ return ret;
+ if (perc > 100)
+ return -EINVAL;
+
+ q->rq_wb->perc = perc;
+ blk_wb_update_limits(q->rq_wb, blk_queue_depth(q));
+ return ret;
+}
+
+static ssize_t queue_wb_cache_delay_show(struct request_queue *q, char *page)
+{
+ if (!q->rq_wb)
+ return -EINVAL;
+
+ return queue_var_show(q->rq_wb->cache_delay_usecs, page);
+}
+
+static ssize_t queue_wb_cache_delay_store(struct request_queue *q,
+ const char *page, size_t count)
+{
+ unsigned long var;
+ ssize_t ret;
+
+ if (!q->rq_wb)
+ return -EINVAL;
+
+ ret = queue_var_store(&var, page, count);
+ if (ret < 0)
+ return ret;
+
+ q->rq_wb->cache_delay_usecs = var;
+ q->rq_wb->cache_delay = usecs_to_jiffies(var);
+ return ret;
+}
+
static ssize_t queue_wc_show(struct request_queue *q, char *page)
{
if (test_bit(QUEUE_FLAG_WC, &q->queue_flags))
@@ -516,6 +587,21 @@ static struct queue_sysfs_entry queue_wc_entry = {
.store = queue_wc_store,
};

+static struct queue_sysfs_entry queue_wb_stats_entry = {
+ .attr = {.name = "wb_stats", .mode = S_IRUGO },
+ .show = queue_wb_stats_show,
+};
+static struct queue_sysfs_entry queue_wb_cache_delay_entry = {
+ .attr = {.name = "wb_cache_usecs", .mode = S_IRUGO | S_IWUSR },
+ .show = queue_wb_cache_delay_show,
+ .store = queue_wb_cache_delay_store,
+};
+static struct queue_sysfs_entry queue_wb_perc_entry = {
+ .attr = {.name = "wb_percent", .mode = S_IRUGO | S_IWUSR },
+ .show = queue_wb_perc_show,
+ .store = queue_wb_perc_store,
+};
+
static struct attribute *default_attrs[] = {
&queue_requests_entry.attr,
&queue_ra_entry.attr,
@@ -542,6 +628,9 @@ static struct attribute *default_attrs[] = {
&queue_random_entry.attr,
&queue_poll_entry.attr,
&queue_wc_entry.attr,
+ &queue_wb_stats_entry.attr,
+ &queue_wb_cache_delay_entry.attr,
+ &queue_wb_perc_entry.attr,
NULL,
};

diff --git a/block/blk-wb.c b/block/blk-wb.c
new file mode 100644
index 000000000000..d93dd1ccf16a
--- /dev/null
+++ b/block/blk-wb.c
@@ -0,0 +1,238 @@
+/*
+ * buffered writeback throttling
+ *
+ * Copyright (C) 2016 Jens Axboe
+ *
+ * Things that need changing:
+ *
+ * - Auto-detection of optimal wb_percent setting. A lower setting
+ * is appropriate on rotating storage (wb_percent=15 gives good
+ * separation, while still getting full bandwidth with wb cache).
+ *
+ */
+#include <linux/kernel.h>
+#include <linux/bio.h>
+#include <linux/blkdev.h>
+
+#include "blk.h"
+#include "blk-wb.h"
+
+static inline bool rwb_enabled(struct rq_wb *rwb)
+{
+ return rwb->wb_normal != 0;
+}
+
+void __blk_wb_done(struct rq_wb *rwb)
+{
+ int inflight, limit = rwb->wb_normal;
+
+ inflight = atomic_dec_return(&rwb->inflight);
+ if (inflight >= limit)
+ return;
+
+ /*
+ * If the device does caching, we can still flood it with IO
+ * even at a low depth. If caching is on, delay a bit before
+ * submitting the next, if we're still purely background
+ * activity.
+ */
+ if (test_bit(QUEUE_FLAG_WC, &rwb->q->queue_flags) && !*rwb->bdp_wait &&
+ time_before(jiffies, rwb->last_comp + rwb->cache_delay)) {
+ if (!timer_pending(&rwb->timer))
+ mod_timer(&rwb->timer, jiffies + rwb->cache_delay);
+ return;
+ }
+
+ if (waitqueue_active(&rwb->wait)) {
+ int diff = limit - inflight;
+
+ if (diff >= rwb->wb_idle / 2)
+ wake_up_nr(&rwb->wait, 1);
+ }
+}
+
+/*
+ * Called on completion of a request. Note that it's also called when
+ * a request is merged, when the request gets freed.
+ */
+void blk_wb_done(struct rq_wb *rwb, struct request *rq)
+{
+ if (!(rq->cmd_flags & REQ_BUF_INFLIGHT)) {
+ if (rwb_enabled(rwb)) {
+ const unsigned long cur = jiffies;
+
+ if (cur != rwb->last_comp)
+ rwb->last_comp = cur;
+ }
+ } else
+ __blk_wb_done(rwb);
+}
+
+/*
+ * Increment 'v', if 'v' is below 'below'. Returns true if we succeeded,
+ * false if 'v' + 1 would be bigger than 'below'.
+ */
+static bool atomic_inc_below(atomic_t *v, int below)
+{
+ int cur = atomic_read(v);
+
+ for (;;) {
+ int old;
+
+ if (cur >= below)
+ return false;
+ old = atomic_cmpxchg(v, cur, cur + 1);
+ if (old == cur)
+ break;
+ cur = old;
+ }
+
+ return true;
+}
+
+static inline unsigned int get_limit(struct rq_wb *rwb, unsigned int rw)
+{
+ unsigned int limit;
+
+ /*
+ * At this point we know it's a buffered write. If REQ_SYNC is
+ * set, then it's WB_SYNC_ALL writeback. Bump the limit 4x for
+ * those, since someone is (or will be) waiting on that.
+ */
+ if ((rw & REQ_SYNC) || *rwb->bdp_wait)
+ limit = rwb->wb_max;
+ else if (time_before(jiffies, rwb->last_comp + HZ / 10)) {
+ /*
+ * If less than 100ms since we completed unrelated IO,
+ * limit us to half the depth for background writeback.
+ */
+ limit = rwb->wb_idle;
+ } else
+ limit = rwb->wb_normal;
+
+ return limit;
+}
+
+/*
+ * Block if we will exceed our limit, or if we are currently waiting for
+ * the timer to kick off queuing again.
+ */
+static void __blk_wb_wait(struct rq_wb *rwb, unsigned int rw, spinlock_t *lock)
+{
+ DEFINE_WAIT(wait);
+
+ if (!timer_pending(&rwb->timer) &&
+ atomic_inc_below(&rwb->inflight, get_limit(rwb, rw)))
+ return;
+
+ do {
+ prepare_to_wait_exclusive(&rwb->wait, &wait,
+ TASK_UNINTERRUPTIBLE);
+
+ if (!timer_pending(&rwb->timer) &&
+ atomic_inc_below(&rwb->inflight, get_limit(rwb, rw)))
+ break;
+
+ if (lock)
+ spin_unlock_irq(lock);
+
+ io_schedule();
+
+ if (lock)
+ spin_lock_irq(lock);
+ } while (1);
+
+ finish_wait(&rwb->wait, &wait);
+}
+
+/*
+ * Returns true if the IO request should be accounted, false if not.
+ * May sleep, if we have exceeded the writeback limits. Caller can pass
+ * in an irq held spinlock, if it holds one when calling this function.
+ * If we do sleep, we'll release and re-grab it.
+ */
+bool blk_wb_wait(struct rq_wb *rwb, struct bio *bio, spinlock_t *lock)
+{
+ /*
+ * If disabled, or not a WRITE (or a discard), do nothing
+ */
+ if (!rwb_enabled(rwb) || !(bio->bi_rw & REQ_WRITE) ||
+ (bio->bi_rw & REQ_DISCARD))
+ return false;
+
+ /*
+ * Don't throttle WRITE_ODIRECT
+ */
+ if ((bio->bi_rw & (REQ_SYNC | REQ_NOIDLE)) == REQ_SYNC)
+ return false;
+
+ __blk_wb_wait(rwb, bio->bi_rw, lock);
+ return true;
+}
+
+static void calc_wb_limits(struct rq_wb *rwb, unsigned int depth,
+ unsigned int perc)
+{
+ /*
+ * We'll use depth==64 as a reasonable max limit that should be able
+ * to achieve full device bandwidth anywhere.
+ */
+ depth = min(64U, depth);
+
+ /*
+ * Full perf writes are max 'perc' percentage of the depth
+ */
+ rwb->wb_max = (perc * depth + 1) / 100;
+ if (!rwb->wb_max && perc)
+ rwb->wb_max = 1;
+ rwb->wb_normal = (rwb->wb_max + 1) / 2;
+ rwb->wb_idle = (rwb->wb_max + 3) / 4;
+}
+
+void blk_wb_update_limits(struct rq_wb *rwb, unsigned int depth)
+{
+ calc_wb_limits(rwb, depth, rwb->perc);
+ wake_up_all(&rwb->wait);
+}
+
+static void blk_wb_timer(unsigned long data)
+{
+ struct rq_wb *rwb = (struct rq_wb *) data;
+
+ if (waitqueue_active(&rwb->wait))
+ wake_up_nr(&rwb->wait, 1);
+}
+
+#define DEF_WB_PERC 50
+#define DEF_WB_CACHE_DELAY 10000
+
+int blk_wb_init(struct request_queue *q)
+{
+ struct rq_wb *rwb;
+
+ rwb = kzalloc(sizeof(*rwb), GFP_KERNEL);
+ if (!rwb)
+ return -ENOMEM;
+
+ atomic_set(&rwb->inflight, 0);
+ init_waitqueue_head(&rwb->wait);
+ rwb->last_comp = jiffies;
+ rwb->bdp_wait = &q->backing_dev_info.wb.dirty_sleeping;
+ setup_timer(&rwb->timer, blk_wb_timer, (unsigned long) rwb);
+ rwb->perc = DEF_WB_PERC;
+ rwb->cache_delay_usecs = DEF_WB_CACHE_DELAY;
+ rwb->cache_delay = usecs_to_jiffies(rwb->cache_delay);
+ rwb->q = q;
+ blk_wb_update_limits(rwb, blk_queue_depth(q));
+ q->rq_wb = rwb;
+ return 0;
+}
+
+void blk_wb_exit(struct request_queue *q)
+{
+ if (q->rq_wb)
+ del_timer_sync(&q->rq_wb->timer);
+
+ kfree(q->rq_wb);
+ q->rq_wb = NULL;
+}
diff --git a/block/blk-wb.h b/block/blk-wb.h
new file mode 100644
index 000000000000..201bc00ac7a7
--- /dev/null
+++ b/block/blk-wb.h
@@ -0,0 +1,33 @@
+#ifndef BLK_WB_H
+#define BLK_WB_H
+
+#include <linux/atomic.h>
+#include <linux/wait.h>
+
+struct rq_wb {
+ /*
+ * Settings that govern how we throttle
+ */
+ unsigned int perc; /* INPUT */
+ unsigned int wb_idle; /* idle writeback */
+ unsigned int wb_normal; /* normal writeback */
+ unsigned int wb_max; /* max throughput writeback */
+
+ unsigned int cache_delay;
+ unsigned int cache_delay_usecs;
+ unsigned long last_comp;
+ unsigned int *bdp_wait;
+ struct request_queue *q;
+ atomic_t inflight;
+ wait_queue_head_t wait;
+ struct timer_list timer;
+};
+
+void __blk_wb_done(struct rq_wb *);
+void blk_wb_done(struct rq_wb *, struct request *);
+bool blk_wb_wait(struct rq_wb *, struct bio *, spinlock_t *);
+int blk_wb_init(struct request_queue *);
+void blk_wb_exit(struct request_queue *);
+void blk_wb_update_limits(struct rq_wb *, unsigned int);
+
+#endif
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 86a38ea1823f..6f2a174b771c 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -188,6 +188,7 @@ enum rq_flag_bits {
__REQ_PM, /* runtime pm request */
__REQ_HASHED, /* on IO scheduler merge hash */
__REQ_MQ_INFLIGHT, /* track inflight for MQ */
+ __REQ_BUF_INFLIGHT, /* track inflight for buffered */
__REQ_NR_BITS, /* stops here */
};

@@ -241,6 +242,7 @@ enum rq_flag_bits {
#define REQ_PM (1ULL << __REQ_PM)
#define REQ_HASHED (1ULL << __REQ_HASHED)
#define REQ_MQ_INFLIGHT (1ULL << __REQ_MQ_INFLIGHT)
+#define REQ_BUF_INFLIGHT (1ULL << __REQ_BUF_INFLIGHT)

typedef unsigned int blk_qc_t;
#define BLK_QC_T_NONE -1U
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 08b897b159d1..ee9b90ff4fde 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -37,6 +37,7 @@ struct bsg_job;
struct blkcg_gq;
struct blk_flush_queue;
struct pr_ops;
+struct rq_wb;

#define BLKDEV_MIN_RQ 4
#define BLKDEV_MAX_RQ 128 /* Default maximum */
@@ -290,6 +291,8 @@ struct request_queue {
int nr_rqs[2]; /* # allocated [a]sync rqs */
int nr_rqs_elvpriv; /* # allocated rqs w/ elvpriv */

+ struct rq_wb *rq_wb;
+
/*
* If blkcg is not used, @q->root_rl serves all requests. If blkcg
* is used, root blkg allocates from @q->root_rl and all other
--
2.8.0.rc4.6.g7e4ba36

2016-03-30 15:08:20

by Jens Axboe

[permalink] [raw]
Subject: [PATCH 7/9] NVMe: inform block layer of write cache state

This isn't quite correct, since the VWC merely states if a potential
write back cache is volatile or not. But for the purpose of write
absortion, it's good enough.

Signed-off-by: Jens Axboe <[email protected]>
---
drivers/nvme/host/core.c | 1 +
1 file changed, 1 insertion(+)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 643f457131c2..05c8edfb7611 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -906,6 +906,7 @@ static void nvme_set_queue_limits(struct nvme_ctrl *ctrl,
if (ctrl->vwc & NVME_CTRL_VWC_PRESENT)
blk_queue_flush(q, REQ_FLUSH | REQ_FUA);
blk_queue_virt_boundary(q, ctrl->page_size - 1);
+ blk_queue_write_cache(q, ctrl->vwc & NVME_CTRL_VWC_PRESENT);
}

/*
--
2.8.0.rc4.6.g7e4ba36

2016-03-30 15:08:19

by Jens Axboe

[permalink] [raw]
Subject: [PATCH 8/9] block: add code to track actual device queue depth

For blk-mq, ->nr_requests does track queue depth, at least at init
time. But for the older queue paths, it's simply a soft setting.
On top of that, it's generally larger than the hardware setting
on purpose, to allow backup of requests for merging.

Fill a hole in struct request with a 'queue_depth' member, that
drivers can call to more closely inform the block layer of the
real queue depth.

Signed-off-by: Jens Axboe <[email protected]>
---
block/blk-settings.c | 6 ++++++
drivers/scsi/scsi.c | 3 +++
include/linux/blkdev.h | 11 +++++++++++
3 files changed, 20 insertions(+)

diff --git a/block/blk-settings.c b/block/blk-settings.c
index 4dbd511a9889..06e01682f827 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -820,6 +820,12 @@ void blk_queue_update_dma_alignment(struct request_queue *q, int mask)
}
EXPORT_SYMBOL(blk_queue_update_dma_alignment);

+void blk_set_queue_depth(struct request_queue *q, unsigned int depth)
+{
+ q->queue_depth = depth;
+}
+EXPORT_SYMBOL(blk_set_queue_depth);
+
/**
* blk_queue_flush - configure queue's cache flush capability
* @q: the request queue for the device
diff --git a/drivers/scsi/scsi.c b/drivers/scsi/scsi.c
index b1bf42b93fcc..6503724865e7 100644
--- a/drivers/scsi/scsi.c
+++ b/drivers/scsi/scsi.c
@@ -621,6 +621,9 @@ int scsi_change_queue_depth(struct scsi_device *sdev, int depth)
wmb();
}

+ if (sdev->request_queue)
+ blk_set_queue_depth(sdev->request_queue, depth);
+
return sdev->queue_depth;
}
EXPORT_SYMBOL(scsi_change_queue_depth);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 76e875159e52..08b897b159d1 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -315,6 +315,8 @@ struct request_queue {
struct blk_mq_ctx __percpu *queue_ctx;
unsigned int nr_queues;

+ unsigned int queue_depth;
+
/* hw dispatch queues */
struct blk_mq_hw_ctx **queue_hw_ctx;
unsigned int nr_hw_queues;
@@ -683,6 +685,14 @@ static inline bool blk_write_same_mergeable(struct bio *a, struct bio *b)
return false;
}

+static inline unsigned int blk_queue_depth(struct request_queue *q)
+{
+ if (q->queue_depth)
+ return q->queue_depth;
+
+ return q->nr_requests;
+}
+
/*
* q->prep_rq_fn return values
*/
@@ -986,6 +996,7 @@ extern void blk_limits_io_min(struct queue_limits *limits, unsigned int min);
extern void blk_queue_io_min(struct request_queue *q, unsigned int min);
extern void blk_limits_io_opt(struct queue_limits *limits, unsigned int opt);
extern void blk_queue_io_opt(struct request_queue *q, unsigned int opt);
+extern void blk_set_queue_depth(struct request_queue *q, unsigned int depth);
extern void blk_set_default_limits(struct queue_limits *lim);
extern void blk_set_stacking_limits(struct queue_limits *lim);
extern int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
--
2.8.0.rc4.6.g7e4ba36

2016-03-30 15:08:17

by Jens Axboe

[permalink] [raw]
Subject: [PATCH 6/9] sd: inform block layer of write cache state

Signed-off-by: Jens Axboe <[email protected]>
---
drivers/scsi/sd.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 5a5457ac9cdb..049f424fb4ad 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -192,6 +192,7 @@ cache_type_store(struct device *dev, struct device_attribute *attr,
sdkp->WCE = wce;
sdkp->RCD = rcd;
sd_set_flush_flag(sdkp);
+ blk_queue_write_cache(sdp->request_queue, wce != 0);
return count;
}

@@ -2571,7 +2572,7 @@ sd_read_cache_type(struct scsi_disk *sdkp, unsigned char *buffer)
sdkp->DPOFUA ? "supports DPO and FUA"
: "doesn't support DPO or FUA");

- return;
+ goto done;
}

bad_sense:
@@ -2596,6 +2597,8 @@ defaults:
}
sdkp->RCD = 0;
sdkp->DPOFUA = 0;
+done:
+ blk_queue_write_cache(sdp->request_queue, sdkp->WCE != 0);
}

/*
--
2.8.0.rc4.6.g7e4ba36

2016-03-30 15:10:45

by Jens Axboe

[permalink] [raw]
Subject: [PATCH 5/9] block: add ability to flag write back caching on a device

Add an internal helper and flag for setting whether a queue has
write back caching, or write through (or none). Add a sysfs file
to show this as well, and make it changeable from user space.

Signed-off-by: Jens Axboe <[email protected]>
---
block/blk-settings.c | 11 +++++++++++
block/blk-sysfs.c | 39 +++++++++++++++++++++++++++++++++++++++
include/linux/blkdev.h | 4 ++++
3 files changed, 54 insertions(+)

diff --git a/block/blk-settings.c b/block/blk-settings.c
index c7bb666aafd1..4dbd511a9889 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -846,6 +846,17 @@ void blk_queue_flush_queueable(struct request_queue *q, bool queueable)
}
EXPORT_SYMBOL_GPL(blk_queue_flush_queueable);

+void blk_queue_write_cache(struct request_queue *q, bool enabled)
+{
+ spin_lock_irq(q->queue_lock);
+ if (enabled)
+ queue_flag_set(QUEUE_FLAG_WC, q);
+ else
+ queue_flag_clear(QUEUE_FLAG_WC, q);
+ spin_unlock_irq(q->queue_lock);
+}
+EXPORT_SYMBOL_GPL(blk_queue_write_cache);
+
static int __init blk_settings_init(void)
{
blk_max_low_pfn = max_low_pfn - 1;
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index dd93763057ce..954e510452d7 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -347,6 +347,38 @@ static ssize_t queue_poll_store(struct request_queue *q, const char *page,
return ret;
}

+static ssize_t queue_wc_show(struct request_queue *q, char *page)
+{
+ if (test_bit(QUEUE_FLAG_WC, &q->queue_flags))
+ return sprintf(page, "write back\n");
+
+ return sprintf(page, "write through\n");
+}
+
+static ssize_t queue_wc_store(struct request_queue *q, const char *page,
+ size_t count)
+{
+ int set = -1;
+
+ if (!strncmp(page, "write back", 10))
+ set = 1;
+ else if (!strncmp(page, "write through", 13) ||
+ !strncmp(page, "none", 4))
+ set = 0;
+
+ if (set == -1)
+ return -EINVAL;
+
+ spin_lock_irq(q->queue_lock);
+ if (set)
+ queue_flag_set(QUEUE_FLAG_WC, q);
+ else
+ queue_flag_clear(QUEUE_FLAG_WC, q);
+ spin_unlock_irq(q->queue_lock);
+
+ return count;
+}
+
static struct queue_sysfs_entry queue_requests_entry = {
.attr = {.name = "nr_requests", .mode = S_IRUGO | S_IWUSR },
.show = queue_requests_show,
@@ -478,6 +510,12 @@ static struct queue_sysfs_entry queue_poll_entry = {
.store = queue_poll_store,
};

+static struct queue_sysfs_entry queue_wc_entry = {
+ .attr = {.name = "write_cache", .mode = S_IRUGO | S_IWUSR },
+ .show = queue_wc_show,
+ .store = queue_wc_store,
+};
+
static struct attribute *default_attrs[] = {
&queue_requests_entry.attr,
&queue_ra_entry.attr,
@@ -503,6 +541,7 @@ static struct attribute *default_attrs[] = {
&queue_iostats_entry.attr,
&queue_random_entry.attr,
&queue_poll_entry.attr,
+ &queue_wc_entry.attr,
NULL,
};

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 7e5d7e018bea..76e875159e52 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -491,15 +491,18 @@ struct request_queue {
#define QUEUE_FLAG_INIT_DONE 20 /* queue is initialized */
#define QUEUE_FLAG_NO_SG_MERGE 21 /* don't attempt to merge SG segments*/
#define QUEUE_FLAG_POLL 22 /* IO polling enabled if set */
+#define QUEUE_FLAG_WC 23 /* Write back caching */

#define QUEUE_FLAG_DEFAULT ((1 << QUEUE_FLAG_IO_STAT) | \
(1 << QUEUE_FLAG_STACKABLE) | \
(1 << QUEUE_FLAG_SAME_COMP) | \
+ (1 << QUEUE_FLAG_WC) | \
(1 << QUEUE_FLAG_ADD_RANDOM))

#define QUEUE_FLAG_MQ_DEFAULT ((1 << QUEUE_FLAG_IO_STAT) | \
(1 << QUEUE_FLAG_STACKABLE) | \
(1 << QUEUE_FLAG_SAME_COMP) | \
+ (1 << QUEUE_FLAG_WC) | \
(1 << QUEUE_FLAG_POLL))

static inline void queue_lockdep_assert_held(struct request_queue *q)
@@ -1009,6 +1012,7 @@ extern void blk_queue_rq_timed_out(struct request_queue *, rq_timed_out_fn *);
extern void blk_queue_rq_timeout(struct request_queue *, unsigned int);
extern void blk_queue_flush(struct request_queue *q, unsigned int flush);
extern void blk_queue_flush_queueable(struct request_queue *q, bool queueable);
+extern void blk_queue_write_cache(struct request_queue *q, bool enabled);
extern struct backing_dev_info *blk_get_backing_dev_info(struct block_device *bdev);

extern int blk_rq_map_sg(struct request_queue *, struct request *, struct scatterlist *);
--
2.8.0.rc4.6.g7e4ba36

2016-03-30 15:11:01

by Jens Axboe

[permalink] [raw]
Subject: [PATCH 4/9] writeback: track if we're sleeping on progress in balance_dirty_pages()

Note in the bdi_writeback structure if a task is currently being
limited in balance_dirty_pages(), waiting for writeback to
proceed.

Signed-off-by: Jens Axboe <[email protected]>
---
include/linux/backing-dev-defs.h | 2 ++
mm/page-writeback.c | 2 ++
2 files changed, 4 insertions(+)

diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index 1b4d69f68c33..f702309216b4 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -116,6 +116,8 @@ struct bdi_writeback {
struct list_head work_list;
struct delayed_work dwork; /* work item used for writeback */

+ int dirty_sleeping; /* waiting on dirty limit exceeded */
+
struct list_head bdi_node; /* anchored at bdi->wb_list */

#ifdef CONFIG_CGROUP_WRITEBACK
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 11ff8f758631..15e696bc5d14 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1746,7 +1746,9 @@ pause:
pause,
start_time);
__set_current_state(TASK_KILLABLE);
+ wb->dirty_sleeping = 1;
io_schedule_timeout(pause);
+ wb->dirty_sleeping = 0;

current->dirty_paused_when = now + pause;
current->nr_dirtied = 0;
--
2.8.0.rc4.6.g7e4ba36

2016-03-30 15:11:39

by Jens Axboe

[permalink] [raw]
Subject: [PATCH 2/9] writeback: add wbc_to_write()

Add wbc_to_write(), which returns the write type to use, based on a
struct writeback_control. No functional changes in this patch, but it
prepares us for factoring other wbc fields for write type.

Signed-off-by: Jens Axboe <[email protected]>
---
fs/block_dev.c | 2 +-
fs/buffer.c | 2 +-
fs/f2fs/data.c | 2 +-
fs/f2fs/node.c | 2 +-
fs/gfs2/meta_io.c | 3 +--
fs/mpage.c | 9 ++++-----
fs/xfs/xfs_aops.c | 2 +-
include/linux/writeback.h | 8 ++++++++
8 files changed, 18 insertions(+), 12 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 3172c4e2f502..b11d4e08b9a7 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -432,7 +432,7 @@ int bdev_write_page(struct block_device *bdev, sector_t sector,
struct page *page, struct writeback_control *wbc)
{
int result;
- int rw = (wbc->sync_mode == WB_SYNC_ALL) ? WRITE_SYNC : WRITE;
+ int rw = wbc_to_write(wbc);
const struct block_device_operations *ops = bdev->bd_disk->fops;

if (!ops->rw_page || bdev_get_integrity(bdev))
diff --git a/fs/buffer.c b/fs/buffer.c
index 33be29675358..28273caaf2b1 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -1697,7 +1697,7 @@ static int __block_write_full_page(struct inode *inode, struct page *page,
struct buffer_head *bh, *head;
unsigned int blocksize, bbits;
int nr_underway = 0;
- int write_op = (wbc->sync_mode == WB_SYNC_ALL ? WRITE_SYNC : WRITE);
+ int write_op = wbc_to_write(wbc);

head = create_page_buffers(page, inode,
(1 << BH_Dirty)|(1 << BH_Uptodate));
diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
index e5c762b37239..dca5d43c67a3 100644
--- a/fs/f2fs/data.c
+++ b/fs/f2fs/data.c
@@ -1143,7 +1143,7 @@ static int f2fs_write_data_page(struct page *page,
struct f2fs_io_info fio = {
.sbi = sbi,
.type = DATA,
- .rw = (wbc->sync_mode == WB_SYNC_ALL) ? WRITE_SYNC : WRITE,
+ .rw = wbc_to_write(wbc),
.page = page,
.encrypted_page = NULL,
};
diff --git a/fs/f2fs/node.c b/fs/f2fs/node.c
index 118321bd1a7f..db9201f45bf1 100644
--- a/fs/f2fs/node.c
+++ b/fs/f2fs/node.c
@@ -1397,7 +1397,7 @@ static int f2fs_write_node_page(struct page *page,
struct f2fs_io_info fio = {
.sbi = sbi,
.type = NODE,
- .rw = (wbc->sync_mode == WB_SYNC_ALL) ? WRITE_SYNC : WRITE,
+ .rw = wbc_to_write(wbc),
.page = page,
.encrypted_page = NULL,
};
diff --git a/fs/gfs2/meta_io.c b/fs/gfs2/meta_io.c
index e137d96f1b17..ede87306caa5 100644
--- a/fs/gfs2/meta_io.c
+++ b/fs/gfs2/meta_io.c
@@ -37,8 +37,7 @@ static int gfs2_aspace_writepage(struct page *page, struct writeback_control *wb
{
struct buffer_head *bh, *head;
int nr_underway = 0;
- int write_op = REQ_META | REQ_PRIO |
- (wbc->sync_mode == WB_SYNC_ALL ? WRITE_SYNC : WRITE);
+ int write_op = REQ_META | REQ_PRIO | wbc_to_write(wbc);

BUG_ON(!PageLocked(page));
BUG_ON(!page_has_buffers(page));
diff --git a/fs/mpage.c b/fs/mpage.c
index 6bd9fd90964e..9986c752f7bb 100644
--- a/fs/mpage.c
+++ b/fs/mpage.c
@@ -486,7 +486,6 @@ static int __mpage_writepage(struct page *page, struct writeback_control *wbc,
struct buffer_head map_bh;
loff_t i_size = i_size_read(inode);
int ret = 0;
- int wr = (wbc->sync_mode == WB_SYNC_ALL ? WRITE_SYNC : WRITE);

if (page_has_buffers(page)) {
struct buffer_head *head = page_buffers(page);
@@ -595,7 +594,7 @@ page_is_mapped:
* This page will go to BIO. Do we need to send this BIO off first?
*/
if (bio && mpd->last_block_in_bio != blocks[0] - 1)
- bio = mpage_bio_submit(wr, bio);
+ bio = mpage_bio_submit(wbc_to_write(wbc), bio);

alloc_new:
if (bio == NULL) {
@@ -622,7 +621,7 @@ alloc_new:
wbc_account_io(wbc, page, PAGE_SIZE);
length = first_unmapped << blkbits;
if (bio_add_page(bio, page, length, 0) < length) {
- bio = mpage_bio_submit(wr, bio);
+ bio = mpage_bio_submit(wbc_to_write(wbc), bio);
goto alloc_new;
}

@@ -632,7 +631,7 @@ alloc_new:
set_page_writeback(page);
unlock_page(page);
if (boundary || (first_unmapped != blocks_per_page)) {
- bio = mpage_bio_submit(wr, bio);
+ bio = mpage_bio_submit(wbc_to_write(wbc), bio);
if (boundary_block) {
write_boundary_block(boundary_bdev,
boundary_block, 1 << blkbits);
@@ -644,7 +643,7 @@ alloc_new:

confused:
if (bio)
- bio = mpage_bio_submit(wr, bio);
+ bio = mpage_bio_submit(wbc_to_write(wbc), bio);

if (mpd->use_writepage) {
ret = mapping->a_ops->writepage(page, wbc);
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index d445a64b979e..239a612ea1d6 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -393,7 +393,7 @@ xfs_submit_ioend_bio(
atomic_inc(&ioend->io_remaining);
bio->bi_private = ioend;
bio->bi_end_io = xfs_end_bio;
- submit_bio(wbc->sync_mode == WB_SYNC_ALL ? WRITE_SYNC : WRITE, bio);
+ submit_bio(wbc_to_write(wbc), bio);
}

STATIC struct bio *
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index d0b5ca5d4e08..719c255e105a 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -100,6 +100,14 @@ struct writeback_control {
#endif
};

+static inline int wbc_to_write(struct writeback_control *wbc)
+{
+ if (wbc->sync_mode == WB_SYNC_ALL)
+ return WRITE_SYNC;
+
+ return WRITE;
+}
+
/*
* A wb_domain represents a domain that wb's (bdi_writeback's) belong to
* and are measured against each other in. There always is one global
--
2.8.0.rc4.6.g7e4ba36

2016-03-30 15:11:38

by Jens Axboe

[permalink] [raw]
Subject: [PATCH 3/9] writeback: use WRITE_SYNC for reclaim or sync writeback

If we're doing reclaim or sync IO, use WRITE_SYNC to inform the lower
levels of the importance of this IO.

Signed-off-by: Jens Axboe <[email protected]>
---
include/linux/writeback.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 719c255e105a..b2c75b8901da 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -102,7 +102,7 @@ struct writeback_control {

static inline int wbc_to_write(struct writeback_control *wbc)
{
- if (wbc->sync_mode == WB_SYNC_ALL)
+ if (wbc->sync_mode == WB_SYNC_ALL || wbc->for_reclaim || wbc->for_sync)
return WRITE_SYNC;

return WRITE;
--
2.8.0.rc4.6.g7e4ba36

2016-03-30 15:42:43

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 5/9] block: add ability to flag write back caching on a device

On Wed, Mar 30, 2016 at 09:07:53AM -0600, Jens Axboe wrote:
> Add an internal helper and flag for setting whether a queue has
> write back caching, or write through (or none). Add a sysfs file
> to show this as well, and make it changeable from user space.

As per previous discussion: I don't really care about the way how we
tell the block layer we have a writeback cache, but a NAK to having
each driver call two interfaces to convey the same information.

Please just look at q->flush_flag & REQ_FLUSH for now, and then
improvem the interface if you don't like it (I don't particularly like
it, but not to the point that I'm motivated enough to fix it :))

2016-03-30 15:48:08

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH 5/9] block: add ability to flag write back caching on a device

On 03/30/2016 09:42 AM, Christoph Hellwig wrote:
> On Wed, Mar 30, 2016 at 09:07:53AM -0600, Jens Axboe wrote:
>> Add an internal helper and flag for setting whether a queue has
>> write back caching, or write through (or none). Add a sysfs file
>> to show this as well, and make it changeable from user space.
>
> As per previous discussion: I don't really care about the way how we
> tell the block layer we have a writeback cache, but a NAK to having
> each driver call two interfaces to convey the same information.
>
> Please just look at q->flush_flag & REQ_FLUSH for now, and then
> improvem the interface if you don't like it (I don't particularly like
> it, but not to the point that I'm motivated enough to fix it :))

That's fine, I don't mind making that change, just didn't do it for this
version. I prefer if we change the cache flagging to be the primary way
to signal it, it's more intuitive than REQ_FLUSH.

It'll be in the next version.

--
Jens Axboe

2016-03-30 16:25:46

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH 5/9] block: add ability to flag write back caching on a device

On 03/30/2016 09:46 AM, Jens Axboe wrote:
> On 03/30/2016 09:42 AM, Christoph Hellwig wrote:
>> On Wed, Mar 30, 2016 at 09:07:53AM -0600, Jens Axboe wrote:
>>> Add an internal helper and flag for setting whether a queue has
>>> write back caching, or write through (or none). Add a sysfs file
>>> to show this as well, and make it changeable from user space.
>>
>> As per previous discussion: I don't really care about the way how we
>> tell the block layer we have a writeback cache, but a NAK to having
>> each driver call two interfaces to convey the same information.
>>
>> Please just look at q->flush_flag & REQ_FLUSH for now, and then
>> improvem the interface if you don't like it (I don't particularly like
>> it, but not to the point that I'm motivated enough to fix it :))
>
> That's fine, I don't mind making that change, just didn't do it for this
> version. I prefer if we change the cache flagging to be the primary way
> to signal it, it's more intuitive than REQ_FLUSH.
>
> It'll be in the next version.

Conversion series in here now:

http://git.kernel.dk/cgit/linux-block/log/?h=wb-buf-throttle

--
Jens Axboe

2016-03-30 17:29:46

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 5/9] block: add ability to flag write back caching on a device

On Wed, Mar 30, 2016 at 10:23:50AM -0600, Jens Axboe wrote:
> Conversion series in here now:
>
> http://git.kernel.dk/cgit/linux-block/log/?h=wb-buf-throttle

The new API looks reasonable to me.

2016-03-31 08:24:43

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCHSET v3][RFC] Make background writeback not suck

On Wed, Mar 30, 2016 at 09:07:48AM -0600, Jens Axboe wrote:
> Hi,
>
> This patchset isn't as much a final solution, as it's demonstration
> of what I believe is a huge issue. Since the dawn of time, our
> background buffered writeback has sucked. When we do background
> buffered writeback, it should have little impact on foreground
> activity. That's the definition of background activity... But for as
> long as I can remember, heavy buffered writers has not behaved like
> that. For instance, if I do something like this:
>
> $ dd if=/dev/zero of=foo bs=1M count=10k
>
> on my laptop, and then try and start chrome, it basically won't start
> before the buffered writeback is done. Or, for server oriented
> workloads, where installation of a big RPM (or similar) adversely
> impacts data base reads or sync writes. When that happens, I get people
> yelling at me.
>
> Last time I posted this, I used flash storage as the example. But
> this works equally well on rotating storage. Let's run a test case
> that writes a lot. This test writes 50 files, each 100M, on XFS on
> a regular hard drive. While this happens, we attempt to read
> another file with fio.
>
> Writers:
>
> $ time (./write-files ; sync)
> real 1m6.304s
> user 0m0.020s
> sys 0m12.210s

Great. So a basic IO tests looks good - let's through something more
complex at it. Say, a benchmark I've been using for years to stress
the Io subsystem, the filesystem and memory reclaim all at the same
time: a concurent fsmark inode creation test.
(first google hit https://lkml.org/lkml/2013/9/10/46)

This generates thousands of REQ_WRITE metadata IOs every second, so
iif I understand how the throttle works correctly, these would be
classified as background writeback by the block layer throttle.
And....

FSUse% Count Size Files/sec App Overhead
0 1600000 0 255845.0 10796891
0 3200000 0 261348.8 10842349
0 4800000 0 249172.3 14121232
0 6400000 0 245172.8 12453759
0 8000000 0 201249.5 14293100
0 9600000 0 200417.5 29496551
>>>> 0 11200000 0 90399.6 40665397
0 12800000 0 212265.6 21839031
0 14400000 0 206398.8 32598378
0 16000000 0 197589.7 26266552
0 17600000 0 206405.2 16447795
>>>> 0 19200000 0 99189.6 87650540
0 20800000 0 249720.8 12294862
0 22400000 0 138523.8 47330007
>>>> 0 24000000 0 85486.2 14271096
0 25600000 0 157538.1 64430611
0 27200000 0 109677.8 47835961
0 28800000 0 207230.5 31301031
0 30400000 0 188739.6 33750424
0 32000000 0 174197.9 41402526
0 33600000 0 139152.0 100838085
0 35200000 0 203729.7 34833764
0 36800000 0 228277.4 12459062
>>>> 0 38400000 0 94962.0 30189182
0 40000000 0 166221.9 40564922
>>>> 0 41600000 0 62902.5 80098461
0 43200000 0 217932.6 22539354
0 44800000 0 189594.6 24692209
0 46400000 0 137834.1 39822038
0 48000000 0 240043.8 12779453
0 49600000 0 176830.8 16604133
0 51200000 0 180771.8 32860221

real 5m35.967s
user 3m57.054s
sys 48m53.332s

In those highlighted report points, the performance has dropped
significantly. The typical range I expect to see ionce memory has
filled (a bit over 8m inodes) is 180k-220k. Runtime on a vanilla
kernel was 4m40s and there were no performance drops, so this
workload runs almost a minute slower with the block layer throttling
code.

What I see in these performance dips is the XFS transaction
subsystem stalling *completely* - instead of running at a steady
state of around 350,000 transactions/s, there are *zero*
transactions running for periods of up to ten seconds. This
co-incides with the CPU usage falling to almost zero as well.
AFAICT, the only thing that is running when the filesystem stalls
like this is memory reclaim.

Without the block throttling patches, the workload quickly finds a
steady state of around 7.5-8.5 million cached inodes, and it doesn't
vary much outside those bounds. With the block throttling patches,
on every transaction subsystem stall that occurs, the inode cache
gets 3-4 million inodes trimmed out of it (i.e. half the
cache), and in a couple of cases I saw it trim 6+ million inodes from
the cache before the transactions started up and the cache started
growing again.

> The above was run without scsi-mq, and with using the deadline scheduler,
> results with CFQ are similary depressing for this test. So IO scheduling
> is in place for this test, it's not pure blk-mq without scheduling.

virtio in guest, XFS direct IO -> no-op -> scsi in host.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2016-03-31 14:29:44

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCHSET v3][RFC] Make background writeback not suck

On 03/31/2016 02:24 AM, Dave Chinner wrote:
> On Wed, Mar 30, 2016 at 09:07:48AM -0600, Jens Axboe wrote:
>> Hi,
>>
>> This patchset isn't as much a final solution, as it's demonstration
>> of what I believe is a huge issue. Since the dawn of time, our
>> background buffered writeback has sucked. When we do background
>> buffered writeback, it should have little impact on foreground
>> activity. That's the definition of background activity... But for as
>> long as I can remember, heavy buffered writers has not behaved like
>> that. For instance, if I do something like this:
>>
>> $ dd if=/dev/zero of=foo bs=1M count=10k
>>
>> on my laptop, and then try and start chrome, it basically won't start
>> before the buffered writeback is done. Or, for server oriented
>> workloads, where installation of a big RPM (or similar) adversely
>> impacts data base reads or sync writes. When that happens, I get people
>> yelling at me.
>>
>> Last time I posted this, I used flash storage as the example. But
>> this works equally well on rotating storage. Let's run a test case
>> that writes a lot. This test writes 50 files, each 100M, on XFS on
>> a regular hard drive. While this happens, we attempt to read
>> another file with fio.
>>
>> Writers:
>>
>> $ time (./write-files ; sync)
>> real 1m6.304s
>> user 0m0.020s
>> sys 0m12.210s
>
> Great. So a basic IO tests looks good - let's through something more
> complex at it. Say, a benchmark I've been using for years to stress
> the Io subsystem, the filesystem and memory reclaim all at the same
> time: a concurent fsmark inode creation test.
> (first google hit https://lkml.org/lkml/2013/9/10/46)

Is that how you are invoking it as well same arguments?

> This generates thousands of REQ_WRITE metadata IOs every second, so
> iif I understand how the throttle works correctly, these would be
> classified as background writeback by the block layer throttle.
> And....
>
> FSUse% Count Size Files/sec App Overhead
> 0 1600000 0 255845.0 10796891
> 0 3200000 0 261348.8 10842349
> 0 4800000 0 249172.3 14121232
> 0 6400000 0 245172.8 12453759
> 0 8000000 0 201249.5 14293100
> 0 9600000 0 200417.5 29496551
>>>>> 0 11200000 0 90399.6 40665397
> 0 12800000 0 212265.6 21839031
> 0 14400000 0 206398.8 32598378
> 0 16000000 0 197589.7 26266552
> 0 17600000 0 206405.2 16447795
>>>>> 0 19200000 0 99189.6 87650540
> 0 20800000 0 249720.8 12294862
> 0 22400000 0 138523.8 47330007
>>>>> 0 24000000 0 85486.2 14271096
> 0 25600000 0 157538.1 64430611
> 0 27200000 0 109677.8 47835961
> 0 28800000 0 207230.5 31301031
> 0 30400000 0 188739.6 33750424
> 0 32000000 0 174197.9 41402526
> 0 33600000 0 139152.0 100838085
> 0 35200000 0 203729.7 34833764
> 0 36800000 0 228277.4 12459062
>>>>> 0 38400000 0 94962.0 30189182
> 0 40000000 0 166221.9 40564922
>>>>> 0 41600000 0 62902.5 80098461
> 0 43200000 0 217932.6 22539354
> 0 44800000 0 189594.6 24692209
> 0 46400000 0 137834.1 39822038
> 0 48000000 0 240043.8 12779453
> 0 49600000 0 176830.8 16604133
> 0 51200000 0 180771.8 32860221
>
> real 5m35.967s
> user 3m57.054s
> sys 48m53.332s
>
> In those highlighted report points, the performance has dropped
> significantly. The typical range I expect to see ionce memory has
> filled (a bit over 8m inodes) is 180k-220k. Runtime on a vanilla
> kernel was 4m40s and there were no performance drops, so this
> workload runs almost a minute slower with the block layer throttling
> code.
>
> What I see in these performance dips is the XFS transaction
> subsystem stalling *completely* - instead of running at a steady
> state of around 350,000 transactions/s, there are *zero*
> transactions running for periods of up to ten seconds. This
> co-incides with the CPU usage falling to almost zero as well.
> AFAICT, the only thing that is running when the filesystem stalls
> like this is memory reclaim.

I'll take a look at this, stalls should definitely not be occurring. How
much memory does the box have?

> Without the block throttling patches, the workload quickly finds a
> steady state of around 7.5-8.5 million cached inodes, and it doesn't
> vary much outside those bounds. With the block throttling patches,
> on every transaction subsystem stall that occurs, the inode cache
> gets 3-4 million inodes trimmed out of it (i.e. half the
> cache), and in a couple of cases I saw it trim 6+ million inodes from
> the cache before the transactions started up and the cache started
> growing again.
>
>> The above was run without scsi-mq, and with using the deadline scheduler,
>> results with CFQ are similary depressing for this test. So IO scheduling
>> is in place for this test, it's not pure blk-mq without scheduling.
>
> virtio in guest, XFS direct IO -> no-op -> scsi in host.

That has write back caching enabled on the guest, correct?

--
Jens Axboe

2016-03-31 16:21:10

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCHSET v3][RFC] Make background writeback not suck

On 03/31/2016 08:29 AM, Jens Axboe wrote:
>> What I see in these performance dips is the XFS transaction
>> subsystem stalling *completely* - instead of running at a steady
>> state of around 350,000 transactions/s, there are *zero*
>> transactions running for periods of up to ten seconds. This
>> co-incides with the CPU usage falling to almost zero as well.
>> AFAICT, the only thing that is running when the filesystem stalls
>> like this is memory reclaim.
>
> I'll take a look at this, stalls should definitely not be occurring. How
> much memory does the box have?

I can't seem to reproduce this at all. On an nvme device, I get a fairly
steady 60K/sec file creation rate, and we're nowhere near being IO
bound. So the throttling has no effect at all.

On a raid0 on 4 flash devices, I get something that looks more IO bound,
for some reason. Still no impact of the throttling, however. But given
that your setup is this:

virtio in guest, XFS direct IO -> no-op -> scsi in host.

we do potentially have two throttling points, which we don't want. Is
both the guest and the host running the new code, or just the guest?

In any case, can I talk you into trying with two patches on top of the
current code? It's the two newest patches here:

http://git.kernel.dk/cgit/linux-block/log/?h=wb-buf-throttle

The first treats REQ_META|REQ_PRIO like they should be treated, like
high priority IO. The second disables throttling for virtual devices, so
we only throttle on the backend. The latter should probably be the other
way around, but we need some way of conveying that information to the
backend.

--
Jens Axboe

2016-03-31 22:10:10

by Holger Hoffstätte

[permalink] [raw]
Subject: Re: [PATCHSET v3][RFC] Make background writeback not suck


Hi,

Jens mentioned on Twitter I should post my experience here as well,
so here we go.

I've backported this series (incl. updates) to stable-4.4.x - not too
difficult, minus the NVM part which I don't need anyway - and have been
running it for the past few days without any problem whatsoever, with
GREAT success.

My use case is primarily larger amounts of stuff (transcoded movies,
finished downloads, built Gentoo packages) that gets copied from tmpfs
to SSD (or disk) and every time that happens, the system noticeably
strangles readers (desktop, interactive shell). It does not really matter
how I tune writeback via the write_expire/dirty_bytes knobs or the
scheduler (and yes, I understand how they work); lowering the writeback
limits helped a bit but the system is still overwhelmed. Jacking up
deadline's writes_starved to unreasonable levels helps a bit, but in turn
makes all writes suffer. Anything else - even tried BFQ for a while,
which has its own unrelated problems - didn't really help either.

With this patchset the buffered writeback in these situations is much
improved, and copying several GBs at once to a SATA-3 SSD (or even an
external USB-2 disk with measly 40 MB/s) doodles along in the background
like it always should have, and desktop work is not noticeably affected.

I guess the effect will be even more noticeable on slower block devices
(laptops, old SSDs or disks).

So: +1 would apply again!

cheers
Holger

2016-04-01 00:47:35

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCHSET v3][RFC] Make background writeback not suck

On Thu, Mar 31, 2016 at 08:29:35AM -0600, Jens Axboe wrote:
> On 03/31/2016 02:24 AM, Dave Chinner wrote:
> >On Wed, Mar 30, 2016 at 09:07:48AM -0600, Jens Axboe wrote:
> >>Hi,
> >>
> >>This patchset isn't as much a final solution, as it's demonstration
> >>of what I believe is a huge issue. Since the dawn of time, our
> >>background buffered writeback has sucked. When we do background
> >>buffered writeback, it should have little impact on foreground
> >>activity. That's the definition of background activity... But for as
> >>long as I can remember, heavy buffered writers has not behaved like
> >>that. For instance, if I do something like this:
> >>
> >>$ dd if=/dev/zero of=foo bs=1M count=10k
> >>
> >>on my laptop, and then try and start chrome, it basically won't start
> >>before the buffered writeback is done. Or, for server oriented
> >>workloads, where installation of a big RPM (or similar) adversely
> >>impacts data base reads or sync writes. When that happens, I get people
> >>yelling at me.
> >>
> >>Last time I posted this, I used flash storage as the example. But
> >>this works equally well on rotating storage. Let's run a test case
> >>that writes a lot. This test writes 50 files, each 100M, on XFS on
> >>a regular hard drive. While this happens, we attempt to read
> >>another file with fio.
> >>
> >>Writers:
> >>
> >>$ time (./write-files ; sync)
> >>real 1m6.304s
> >>user 0m0.020s
> >>sys 0m12.210s
> >
> >Great. So a basic IO tests looks good - let's through something more
> >complex at it. Say, a benchmark I've been using for years to stress
> >the Io subsystem, the filesystem and memory reclaim all at the same
> >time: a concurent fsmark inode creation test.
> >(first google hit https://lkml.org/lkml/2013/9/10/46)
>
> Is that how you are invoking it as well same arguments?

Yes. And the VM is exactly the same, too - 16p/16GB RAM. Cut down
version of the script I use:

#!/bin/bash

QUOTA=
MKFSOPTS=
NFILES=100000
DEV=/dev/vdc
LOGBSIZE=256k
FSMARK=/home/dave/src/fs_mark-3.3/fs_mark
MNT=/mnt/scratch

while [ $# -gt 0 ]; do
case "$1" in
-q) QUOTA="uquota,gquota,pquota" ;;
-N) NFILES=$2 ; shift ;;
-d) DEV=$2 ; shift ;;
-l) LOGBSIZE=$2; shift ;;
--) shift ; break ;;
esac
shift
done
MKFSOPTS="$MKFSOPTS $*"

echo QUOTA=$QUOTA
echo MKFSOPTS=$MKFSOPTS
echo DEV=$DEV

sudo umount $MNT > /dev/null 2>&1
sudo mkfs.xfs -f $MKFSOPTS $DEV
sudo mount -o nobarrier,logbsize=$LOGBSIZE,$QUOTA $DEV $MNT
sudo chmod 777 $MNT
sudo sh -c "echo 1 > /proc/sys/fs/xfs/stats_clear"
time $FSMARK -D 10000 -S0 -n $NFILES -s 0 -L 32 \
-d $MNT/0 -d $MNT/1 \
-d $MNT/2 -d $MNT/3 \
-d $MNT/4 -d $MNT/5 \
-d $MNT/6 -d $MNT/7 \
-d $MNT/8 -d $MNT/9 \
-d $MNT/10 -d $MNT/11 \
-d $MNT/12 -d $MNT/13 \
-d $MNT/14 -d $MNT/15 \
| tee >(stats --trim-outliers | tail -1 1>&2)
sync
sudo umount /mnt/scratch
$

> >>The above was run without scsi-mq, and with using the deadline scheduler,
> >>results with CFQ are similary depressing for this test. So IO scheduling
> >>is in place for this test, it's not pure blk-mq without scheduling.
> >
> >virtio in guest, XFS direct IO -> no-op -> scsi in host.
>
> That has write back caching enabled on the guest, correct?

No. It uses virtio,cache=none (that's the "XFS Direct IO" bit above).
Sorry for not being clear about that.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2016-04-01 00:56:13

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCHSET v3][RFC] Make background writeback not suck

On Thu, Mar 31, 2016 at 10:21:04AM -0600, Jens Axboe wrote:
> On 03/31/2016 08:29 AM, Jens Axboe wrote:
> >>What I see in these performance dips is the XFS transaction
> >>subsystem stalling *completely* - instead of running at a steady
> >>state of around 350,000 transactions/s, there are *zero*
> >>transactions running for periods of up to ten seconds. This
> >>co-incides with the CPU usage falling to almost zero as well.
> >>AFAICT, the only thing that is running when the filesystem stalls
> >>like this is memory reclaim.
> >
> >I'll take a look at this, stalls should definitely not be occurring. How
> >much memory does the box have?
>
> I can't seem to reproduce this at all. On an nvme device, I get a
> fairly steady 60K/sec file creation rate, and we're nowhere near
> being IO bound. So the throttling has no effect at all.

That's too slow to show the stalls - your likely concurrency bound
in allocation by the default AG count (4) from mkfs. Use mkfs.xfs -d
agcount=32 so that every thread works in it's own AG.

> On a raid0 on 4 flash devices, I get something that looks more IO
> bound, for some reason. Still no impact of the throttling, however.
> But given that your setup is this:
>
> virtio in guest, XFS direct IO -> no-op -> scsi in host.
>
> we do potentially have two throttling points, which we don't want.
> Is both the guest and the host running the new code, or just the
> guest?

Just the guest. Host is running a 4.2.x kernel, IIRC.

> In any case, can I talk you into trying with two patches on top of
> the current code? It's the two newest patches here:
>
> http://git.kernel.dk/cgit/linux-block/log/?h=wb-buf-throttle
>
> The first treats REQ_META|REQ_PRIO like they should be treated, like
> high priority IO. The second disables throttling for virtual
> devices, so we only throttle on the backend. The latter should
> probably be the other way around, but we need some way of conveying
> that information to the backend.

I'm not changing the host kernels - it's a production machine and so
it runs long uptime testing of stable kernels. (e.g. catch slow
memory leaks, etc). So if you've disabled throttling in the guest, I
can't test the throttling changes.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2016-04-01 01:01:44

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCHSET v3][RFC] Make background writeback not suck

On Thu, Mar 31, 2016 at 10:09:56PM +0000, Holger Hoffst?tte wrote:
>
> Hi,
>
> Jens mentioned on Twitter I should post my experience here as well,
> so here we go.
>
> I've backported this series (incl. updates) to stable-4.4.x - not too
> difficult, minus the NVM part which I don't need anyway - and have been
> running it for the past few days without any problem whatsoever, with
> GREAT success.
>
> My use case is primarily larger amounts of stuff (transcoded movies,
> finished downloads, built Gentoo packages) that gets copied from tmpfs
> to SSD (or disk) and every time that happens, the system noticeably
> strangles readers (desktop, interactive shell). It does not really matter
> how I tune writeback via the write_expire/dirty_bytes knobs or the
> scheduler (and yes, I understand how they work); lowering the writeback
> limits helped a bit but the system is still overwhelmed. Jacking up
> deadline's writes_starved to unreasonable levels helps a bit, but in turn
> makes all writes suffer. Anything else - even tried BFQ for a while,
> which has its own unrelated problems - didn't really help either.

Can you go back to your original kernel, and lower nr_requests to 8?

Essentially all I see the block throttle doing is keeping the
request queue depth to somewhere between 8-12 requests, rather than
letting it blow out to near nr_requests (around 105-115), so it
would be interesting to note whether the block throttling has any
noticable difference in behaviour when compared to just having a
very shallow request queue....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2016-04-01 03:25:53

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCHSET v3][RFC] Make background writeback not suck

On 03/31/2016 06:46 PM, Dave Chinner wrote:
> On Thu, Mar 31, 2016 at 08:29:35AM -0600, Jens Axboe wrote:
>> On 03/31/2016 02:24 AM, Dave Chinner wrote:
>>> On Wed, Mar 30, 2016 at 09:07:48AM -0600, Jens Axboe wrote:
>>>> Hi,
>>>>
>>>> This patchset isn't as much a final solution, as it's demonstration
>>>> of what I believe is a huge issue. Since the dawn of time, our
>>>> background buffered writeback has sucked. When we do background
>>>> buffered writeback, it should have little impact on foreground
>>>> activity. That's the definition of background activity... But for as
>>>> long as I can remember, heavy buffered writers has not behaved like
>>>> that. For instance, if I do something like this:
>>>>
>>>> $ dd if=/dev/zero of=foo bs=1M count=10k
>>>>
>>>> on my laptop, and then try and start chrome, it basically won't start
>>>> before the buffered writeback is done. Or, for server oriented
>>>> workloads, where installation of a big RPM (or similar) adversely
>>>> impacts data base reads or sync writes. When that happens, I get people
>>>> yelling at me.
>>>>
>>>> Last time I posted this, I used flash storage as the example. But
>>>> this works equally well on rotating storage. Let's run a test case
>>>> that writes a lot. This test writes 50 files, each 100M, on XFS on
>>>> a regular hard drive. While this happens, we attempt to read
>>>> another file with fio.
>>>>
>>>> Writers:
>>>>
>>>> $ time (./write-files ; sync)
>>>> real 1m6.304s
>>>> user 0m0.020s
>>>> sys 0m12.210s
>>>
>>> Great. So a basic IO tests looks good - let's through something more
>>> complex at it. Say, a benchmark I've been using for years to stress
>>> the Io subsystem, the filesystem and memory reclaim all at the same
>>> time: a concurent fsmark inode creation test.
>>> (first google hit https://lkml.org/lkml/2013/9/10/46)
>>
>> Is that how you are invoking it as well same arguments?
>
> Yes. And the VM is exactly the same, too - 16p/16GB RAM. Cut down
> version of the script I use:
>
> #!/bin/bash
>
> QUOTA=
> MKFSOPTS=
> NFILES=100000
> DEV=/dev/vdc
> LOGBSIZE=256k
> FSMARK=/home/dave/src/fs_mark-3.3/fs_mark
> MNT=/mnt/scratch
>
> while [ $# -gt 0 ]; do
> case "$1" in
> -q) QUOTA="uquota,gquota,pquota" ;;
> -N) NFILES=$2 ; shift ;;
> -d) DEV=$2 ; shift ;;
> -l) LOGBSIZE=$2; shift ;;
> --) shift ; break ;;
> esac
> shift
> done
> MKFSOPTS="$MKFSOPTS $*"
>
> echo QUOTA=$QUOTA
> echo MKFSOPTS=$MKFSOPTS
> echo DEV=$DEV
>
> sudo umount $MNT > /dev/null 2>&1
> sudo mkfs.xfs -f $MKFSOPTS $DEV
> sudo mount -o nobarrier,logbsize=$LOGBSIZE,$QUOTA $DEV $MNT
> sudo chmod 777 $MNT
> sudo sh -c "echo 1 > /proc/sys/fs/xfs/stats_clear"
> time $FSMARK -D 10000 -S0 -n $NFILES -s 0 -L 32 \
> -d $MNT/0 -d $MNT/1 \
> -d $MNT/2 -d $MNT/3 \
> -d $MNT/4 -d $MNT/5 \
> -d $MNT/6 -d $MNT/7 \
> -d $MNT/8 -d $MNT/9 \
> -d $MNT/10 -d $MNT/11 \
> -d $MNT/12 -d $MNT/13 \
> -d $MNT/14 -d $MNT/15 \
> | tee >(stats --trim-outliers | tail -1 1>&2)
> sync
> sudo umount /mnt/scratch

Perfect, thanks!

>>>> The above was run without scsi-mq, and with using the deadline scheduler,
>>>> results with CFQ are similary depressing for this test. So IO scheduling
>>>> is in place for this test, it's not pure blk-mq without scheduling.
>>>
>>> virtio in guest, XFS direct IO -> no-op -> scsi in host.
>>
>> That has write back caching enabled on the guest, correct?
>
> No. It uses virtio,cache=none (that's the "XFS Direct IO" bit above).
> Sorry for not being clear about that.

That's fine, it's one less worry if that's not the case. So if you cat
the 'write_cache' file in the virtioblk sysfs block queue/ directory, it
says 'write through'? Just want to confirm that we got that propagated
correctly.


--
Jens Axboe

2016-04-01 03:29:38

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCHSET v3][RFC] Make background writeback not suck

On 03/31/2016 06:56 PM, Dave Chinner wrote:
> On Thu, Mar 31, 2016 at 10:21:04AM -0600, Jens Axboe wrote:
>> On 03/31/2016 08:29 AM, Jens Axboe wrote:
>>>> What I see in these performance dips is the XFS transaction
>>>> subsystem stalling *completely* - instead of running at a steady
>>>> state of around 350,000 transactions/s, there are *zero*
>>>> transactions running for periods of up to ten seconds. This
>>>> co-incides with the CPU usage falling to almost zero as well.
>>>> AFAICT, the only thing that is running when the filesystem stalls
>>>> like this is memory reclaim.
>>>
>>> I'll take a look at this, stalls should definitely not be occurring. How
>>> much memory does the box have?
>>
>> I can't seem to reproduce this at all. On an nvme device, I get a
>> fairly steady 60K/sec file creation rate, and we're nowhere near
>> being IO bound. So the throttling has no effect at all.
>
> That's too slow to show the stalls - your likely concurrency bound
> in allocation by the default AG count (4) from mkfs. Use mkfs.xfs -d
> agcount=32 so that every thread works in it's own AG.

That's the key, with that I get 300-400K ops/sec instead. I'll run some
testing with this tomorrow and see what I can find, it did one full run
now and I didn't see any issues, but I need to run it at various
settings and see if I can find the issue.

>> On a raid0 on 4 flash devices, I get something that looks more IO
>> bound, for some reason. Still no impact of the throttling, however.
>> But given that your setup is this:
>>
>> virtio in guest, XFS direct IO -> no-op -> scsi in host.
>>
>> we do potentially have two throttling points, which we don't want.
>> Is both the guest and the host running the new code, or just the
>> guest?
>
> Just the guest. Host is running a 4.2.x kernel, IIRC.

OK

>> In any case, can I talk you into trying with two patches on top of
>> the current code? It's the two newest patches here:
>>
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__git.kernel.dk_cgit_linux-2Dblock_log_-3Fh-3Dwb-2Dbuf-2Dthrottle&d=CwIBAg&c=5VD0RTtNlTh3ycd41b3MUw&r=cK1a7KivzZRh1fKQMjSm2A&m=68CEi93IKLje5aOoxk1y9HMe_HF9pAhzxJGTmTZ7_DY&s=NeYNPvJa3VdF_EEsL8VqAQzJ4UycbXZ5PzHihwZAc_A&e=
>>
>> The first treats REQ_META|REQ_PRIO like they should be treated, like
>> high priority IO. The second disables throttling for virtual
>> devices, so we only throttle on the backend. The latter should
>> probably be the other way around, but we need some way of conveying
>> that information to the backend.
>
> I'm not changing the host kernels - it's a production machine and so
> it runs long uptime testing of stable kernels. (e.g. catch slow
> memory leaks, etc). So if you've disabled throttling in the guest, I
> can't test the throttling changes.

Right, that'd definitely hide the problem for you. I'll see if I can get
it in a reproducible state and take it from there.

On your host, you said it's SCSI backed, but what does the device look like?

--
Jens Axboe

2016-04-01 03:34:04

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCHSET v3][RFC] Make background writeback not suck

On 03/31/2016 09:29 PM, Jens Axboe wrote:
>> I'm not changing the host kernels - it's a production machine and so
>> it runs long uptime testing of stable kernels. (e.g. catch slow
>> memory leaks, etc). So if you've disabled throttling in the guest, I
>> can't test the throttling changes.
>
> Right, that'd definitely hide the problem for you. I'll see if I can get
> it in a reproducible state and take it from there.

Though on the guest, if you could try with just this one applied:

http://git.kernel.dk/cgit/linux-block/commit/?h=wb-buf-throttle&id=f21fb0e42c7347bd639a17341dcd3f72c1a30d29

I'd appreciate it. It won't disable the throttling in the guest, just
treat META and PRIO a bit differently.

--
Jens Axboe

2016-04-01 03:39:30

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCHSET v3][RFC] Make background writeback not suck

On 03/31/2016 09:29 PM, Jens Axboe wrote:
>>> I can't seem to reproduce this at all. On an nvme device, I get a
>>> fairly steady 60K/sec file creation rate, and we're nowhere near
>>> being IO bound. So the throttling has no effect at all.
>>
>> That's too slow to show the stalls - your likely concurrency bound
>> in allocation by the default AG count (4) from mkfs. Use mkfs.xfs -d
>> agcount=32 so that every thread works in it's own AG.
>
> That's the key, with that I get 300-400K ops/sec instead. I'll run some
> testing with this tomorrow and see what I can find, it did one full run
> now and I didn't see any issues, but I need to run it at various
> settings and see if I can find the issue.

No stalls seen, I get the same performance with it disabled and with it
enabled, at both default settings, and lower ones (wb_percent=20).
Looking at iostat, we don't drive a lot of depth, so it makes sense,
even with the throttling we're doing essentially the same amount of IO.

What does 'nr_requests' say for your virtio_blk device? Looks like
virtio_blk has a queue_depth setting, but it's not set by default, and
then it uses the free entries in the ring. But I don't know what that is...

--
Jens Axboe

2016-04-01 05:04:29

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCHSET v3][RFC] Make background writeback not suck

On Thu, Mar 31, 2016 at 09:29:30PM -0600, Jens Axboe wrote:
> On 03/31/2016 06:56 PM, Dave Chinner wrote:
> >I'm not changing the host kernels - it's a production machine and so
> >it runs long uptime testing of stable kernels. (e.g. catch slow
> >memory leaks, etc). So if you've disabled throttling in the guest, I
> >can't test the throttling changes.
>
> Right, that'd definitely hide the problem for you. I'll see if I can
> get it in a reproducible state and take it from there.
>
> On your host, you said it's SCSI backed, but what does the device look like?

HW RAID 0 w/ 1GB FBWC (dell h710, IIRC) of 2x200GB SATA SSDs
(actually 256GB, but 25% of each is left as spare, unused space).
Sustains about 35,000 random 4k write IOPS, up to 70k read IOPS.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2016-04-01 06:16:28

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCHSET v3][RFC] Make background writeback not suck

On Thu, Mar 31, 2016 at 09:39:25PM -0600, Jens Axboe wrote:
> On 03/31/2016 09:29 PM, Jens Axboe wrote:
> >>>I can't seem to reproduce this at all. On an nvme device, I get a
> >>>fairly steady 60K/sec file creation rate, and we're nowhere near
> >>>being IO bound. So the throttling has no effect at all.
> >>
> >>That's too slow to show the stalls - your likely concurrency bound
> >>in allocation by the default AG count (4) from mkfs. Use mkfs.xfs -d
> >>agcount=32 so that every thread works in it's own AG.
> >
> >That's the key, with that I get 300-400K ops/sec instead. I'll run some
> >testing with this tomorrow and see what I can find, it did one full run
> >now and I didn't see any issues, but I need to run it at various
> >settings and see if I can find the issue.
>
> No stalls seen, I get the same performance with it disabled and with
> it enabled, at both default settings, and lower ones
> (wb_percent=20). Looking at iostat, we don't drive a lot of depth,
> so it makes sense, even with the throttling we're doing essentially
> the same amount of IO.

Try appending numa=fake=4 to your guest's kernel command line.

(that's what I'm using)

>
> What does 'nr_requests' say for your virtio_blk device? Looks like
> virtio_blk has a queue_depth setting, but it's not set by default,
> and then it uses the free entries in the ring. But I don't know what
> that is...

$ cat /sys/block/vdc/queue/nr_requests
128
$

Without the block throttling, guest IO (measured within the guest)
looks like this over a fair proportion of the test (5s sample time)

# iostat -d -x -m 5 /dev/vdc

Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
vdc 0.00 20443.00 6.20 436.60 0.05 269.89 1248.48 73.83 146.11 486.58 141.27 1.64 72.40
vdc 0.00 11567.60 19.20 161.40 0.05 146.08 1657.12 119.17 704.57 707.25 704.25 5.34 96.48
vdc 0.00 12723.20 3.20 437.40 0.05 193.65 900.38 29.46 57.12 1.75 57.52 0.78 34.56
vdc 0.00 1739.80 22.40 426.80 0.05 123.62 563.86 23.44 62.51 79.89 61.59 1.01 45.28
vdc 0.00 12553.80 0.00 521.20 0.00 210.86 828.54 34.38 65.96 0.00 65.96 0.97 50.80
vdc 0.00 12523.60 25.60 529.60 0.10 201.94 745.29 52.24 77.73 0.41 81.47 1.14 63.20
vdc 0.00 5419.80 22.40 502.60 0.05 158.34 617.90 24.42 63.81 30.96 65.27 1.31 68.80
vdc 0.00 12059.00 0.00 439.60 0.00 174.85 814.59 30.91 70.27 0.00 70.27 0.72 31.76
vdc 0.00 7578.00 25.60 397.00 0.10 139.18 675.00 15.72 37.26 61.19 35.72 0.73 30.72
vdc 0.00 9156.00 0.00 537.40 0.00 173.57 661.45 17.08 29.62 0.00 29.62 0.53 28.72
vdc 0.00 5274.80 22.40 377.60 0.05 136.42 698.77 26.17 68.33 186.96 61.30 1.53 61.36
vdc 0.00 9407.00 3.20 541.00 0.05 174.28 656.05 36.10 66.33 3.00 66.71 0.87 47.60
vdc 0.00 8687.20 22.40 410.40 0.05 150.98 714.70 39.91 92.21 93.82 92.12 1.39 60.32
vdc 0.00 8872.80 0.00 422.60 0.00 139.28 674.96 25.01 33.03 0.00 33.03 0.91 38.40
vdc 0.00 1081.60 22.40 241.00 0.05 68.88 535.97 10.78 82.89 137.86 77.79 2.25 59.20
vdc 0.00 9826.80 0.00 445.00 0.00 167.42 770.49 45.16 101.49 0.00 101.49 1.80 79.92
vdc 0.00 7394.00 22.40 447.60 0.05 157.34 685.83 18.06 38.42 77.64 36.46 1.46 68.48
vdc 0.00 9984.80 3.20 252.00 0.05 108.46 870.82 85.68 293.73 16.75 297.24 3.00 76.64
vdc 0.00 0.00 22.40 454.20 0.05 117.67 505.86 8.11 39.51 35.71 39.70 1.17 55.76
vdc 0.00 10273.20 0.00 418.80 0.00 156.76 766.57 90.52 179.40 0.00 179.40 1.85 77.52
vdc 0.00 5650.00 22.40 185.00 0.05 84.12 831.20 103.90 575.15 60.82 637.42 4.21 87.36
vdc 0.00 7193.00 0.00 308.80 0.00 120.71 800.56 63.77 194.35 0.00 194.35 2.24 69.12
vdc 0.00 4460.80 9.80 211.00 0.03 69.52 645.07 72.35 154.81 269.39 149.49 4.42 97.60
vdc 0.00 683.00 14.00 374.60 0.05 99.13 522.69 25.38 167.61 603.14 151.33 1.45 56.24
vdc 0.00 7140.20 1.80 275.20 0.03 104.53 773.06 85.25 202.67 32.44 203.79 2.80 77.68
vdc 0.00 6916.00 0.00 164.00 0.00 82.59 1031.33 126.20 813.60 0.00 813.60 6.10 100.00
vdc 0.00 2255.60 22.40 359.00 0.05 107.41 577.06 42.97 170.03 92.79 174.85 2.17 82.64
vdc 0.00 7580.40 3.20 370.40 0.05 128.32 703.70 60.19 134.11 15.00 135.14 1.64 61.36
vdc 0.00 6438.40 18.80 159.20 0.04 78.04 898.36 126.80 706.27 639.15 714.19 5.62 100.00
vdc 0.00 5420.00 3.60 315.40 0.01 108.87 699.07 20.80 78.54 580.00 72.81 1.03 32.72
vdc 0.00 9444.00 2.60 242.40 0.00 118.72 992.38 126.21 488.66 146.15 492.33 4.08 100.00
vdc 0.00 0.00 19.80 434.60 0.05 110.14 496.65 12.74 57.56 313.78 45.89 1.10 49.84
vdc 0.00 14108.20 3.20 549.60 0.05 207.84 770.17 42.32 69.66 72.75 69.64 1.40 77.20
vdc 0.00 1306.40 35.20 268.20 0.08 78.74 532.08 30.84 114.22 175.07 106.24 2.02 61.20
vdc 0.00 14999.40 0.00 458.60 0.00 192.03 857.57 61.48 134.02 0.00 134.02 1.67 76.80
vdc 0.00 1.40 22.40 331.80 0.05 82.11 475.11 1.74 4.87 22.68 3.66 0.76 26.96
vdc 0.00 13971.80 0.00 670.20 0.00 248.26 758.63 34.45 51.37 0.00 51.37 1.04 69.52
vdc 0.00 7033.00 22.60 205.80 0.06 87.81 787.86 40.95 128.53 244.64 115.78 2.90 66.24
vdc 0.00 1282.00 3.20 456.00 0.05 123.21 549.74 14.56 46.99 21.00 47.17 1.42 65.20
vdc 0.00 9475.80 22.40 248.60 0.05 107.66 814.02 123.94 412.61 376.64 415.86 3.69 100.00
vdc 0.00 3603.60 0.00 418.80 0.00 133.32 651.94 71.28 210.08 0.00 210.08 1.77 74.00

You can see hat there are periods where it drives the request queue
depth to congestion, but most of the time the device is only 60-70%
utilised and the queue depths are only 30-40 deep. THere's quite a
lot of idle time in the request queue.

Note that there are a couple of points where merging stops
completely - that's when memory reclaim is directly flushing dirty
inodes because if all we have is cached inodes then we have toi
throttle memory allocation back to the rate we at which we can clean
dirty inodes.

Throughput does drop when this happens, but because the device has
idle overhead and spare request queue space, these less than optimal
IO dispatch spikes don't really affect throughput because the device
has the capacity available to soak them up without dropping
performance.


An equivalent trace from the middle of a run with block throttling
enabled:

Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
vdc 0.00 5143.40 22.40 188.00 0.05 81.04 789.38 19.17 89.09 237.86 71.37 4.75 99.92
vdc 0.00 7182.60 0.00 272.60 0.00 116.74 877.06 15.50 57.91 0.00 57.91 3.66 99.84
vdc 0.00 3732.80 11.60 102.20 0.01 53.17 957.05 19.35 151.47 514.00 110.32 8.79 100.00
vdc 0.00 0.00 10.80 1007.20 0.04 104.33 209.98 10.22 12.49 457.85 7.71 0.92 93.44
vdc 0.00 0.00 0.40 822.80 0.01 47.58 118.40 9.39 11.21 10.00 11.21 1.22 100.24
vdc 0.00 0.00 1.00 227.00 0.02 3.55 32.00 0.22 1.54 224.00 0.56 0.48 11.04
vdc 0.00 11100.40 3.20 437.40 0.05 125.91 585.49 7.95 17.77 47.25 17.55 1.56 68.56
vdc 0.00 14134.20 22.40 453.20 0.05 191.26 823.85 15.99 32.91 73.07 30.92 2.09 99.28
vdc 0.00 6667.00 0.00 265.20 0.00 105.11 811.70 15.71 58.98 0.00 58.98 3.77 100.00
vdc 0.00 6243.40 22.40 259.00 0.05 101.23 737.11 17.53 62.97 115.21 58.45 3.55 99.92
vdc 0.00 5590.80 0.00 278.20 0.00 105.30 775.18 18.09 65.55 0.00 65.55 3.59 100.00
vdc 0.00 0.00 14.20 714.80 0.02 97.81 274.86 11.61 12.86 260.85 7.93 1.23 89.44
vdc 0.00 0.00 9.80 1555.00 0.05 126.19 165.22 5.41 4.91 267.02 3.26 0.53 82.96
vdc 0.00 0.00 3.00 816.80 0.05 22.32 55.89 6.07 7.39 256.00 6.48 1.05 85.84
vdc 0.00 11172.80 0.20 463.00 0.00 125.77 556.10 6.13 13.23 260.00 13.13 0.93 43.28
vdc 0.00 9563.00 22.40 324.60 0.05 119.66 706.55 15.50 38.45 10.39 40.39 2.88 99.84
vdc 0.00 5333.60 0.00 218.00 0.00 83.71 786.46 15.57 80.46 0.00 80.46 4.59 100.00
vdc 0.00 5128.00 24.80 216.60 0.06 85.31 724.28 19.18 79.84 193.13 66.87 4.12 99.52
vdc 0.00 2746.40 0.00 257.40 0.00 81.13 645.49 11.16 43.70 0.00 43.70 3.87 99.68
vdc 0.00 0.00 0.00 418.80 0.00 104.68 511.92 5.33 12.74 0.00 12.74 1.93 80.96
vdc 0.00 8102.00 0.20 291.60 0.00 108.79 763.59 3.09 10.60 20.00 10.59 0.87 25.44


The first thing to note is the device utilisation is almost always
above 80%, and often at 100%, meaning with throttling the device
always has IO in flight. It's got no real idle time to soak up peaks
of IO activity - throttling means the device is running at close to
100% utilisation all the time under worklaods like this, and there's
not elasticity in the pipeline to handle changes in IO dispatch
behaviour.

So when memory reclaim does direct inode writeback, we see merging
stop, but the request queue is not able to soak up all the IO being
dispatched, even though there is very little read IO demand. hence
changes in the dispatch patterns that would drive deeper queues and
maintain performance will now get throttled, resulting in things
like memory reclaim backing up a lot and everything on the machine
suffering.

I'll try the "don't throttle REQ_META" patch, but this seems like a
fragile way to solve this problem - it shuts up the messenger, but
doesn't solve the problem for any other subsystem that might have a
similer issue. e.g. next we're going to have to make sure direct IO
(which is also REQ_WRITE dispatch) does not get throttled, and so
on....

It seems to me that the right thing to do here is add a separate
classification flag for IO that can be throttled. e.g. as
REQ_WRITEBACK and only background writeback work sets this flag.
That would ensure that when the IO is being dispatched from other
sources (e.g. fsync, sync_file_range(), direct IO, filesystem
metadata, etc) it is clear that it is not a target for throttling.
This would also allow us to easily switch off throttling if
writeback is occurring for memory reclaim reasons, and so on.
Throttling policy decisions belong above the block layer, even
though the throttle mechanism itself is in the block layer.

FWIW, this is analogous to REQ_READA, which tells the block layer
that a read is not important and can be discarded if there is too
much load. Policy is set at the layer that knows whether the IO can
be discarded safely, the mechanism is implemented at a lower layer
that knows about load, scheduling and other things the higher layers
know nothing about.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2016-04-01 06:28:03

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCHSET v3][RFC] Make background writeback not suck

On Thu, Mar 31, 2016 at 09:25:33PM -0600, Jens Axboe wrote:
> On 03/31/2016 06:46 PM, Dave Chinner wrote:
> >>>virtio in guest, XFS direct IO -> no-op -> scsi in host.
> >>
> >>That has write back caching enabled on the guest, correct?
> >
> >No. It uses virtio,cache=none (that's the "XFS Direct IO" bit above).
> >Sorry for not being clear about that.
>
> That's fine, it's one less worry if that's not the case. So if you
> cat the 'write_cache' file in the virtioblk sysfs block queue/
> directory, it says 'write through'? Just want to confirm that we got
> that propagated correctly.

No such file. But I did find:

$ cat /sys/block/vdc/cache_type
write back

Which is what I'd expect it to safe given the man page description
of cache=none:

Note that this is considered a writeback mode and the guest
OS must handle the disk write cache correctly in order to
avoid data corruption on host crashes.

To make it say "write through" I need to use cache=directsync, but
I have no need for such integrity guarantees on a volatile test
device...

Cheers,

Dave.
--
Dave Chinner
[email protected]

2016-04-01 14:33:25

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCHSET v3][RFC] Make background writeback not suck

On 04/01/2016 12:16 AM, Dave Chinner wrote:
> On Thu, Mar 31, 2016 at 09:39:25PM -0600, Jens Axboe wrote:
>> On 03/31/2016 09:29 PM, Jens Axboe wrote:
>>>>> I can't seem to reproduce this at all. On an nvme device, I get a
>>>>> fairly steady 60K/sec file creation rate, and we're nowhere near
>>>>> being IO bound. So the throttling has no effect at all.
>>>>
>>>> That's too slow to show the stalls - your likely concurrency bound
>>>> in allocation by the default AG count (4) from mkfs. Use mkfs.xfs -d
>>>> agcount=32 so that every thread works in it's own AG.
>>>
>>> That's the key, with that I get 300-400K ops/sec instead. I'll run some
>>> testing with this tomorrow and see what I can find, it did one full run
>>> now and I didn't see any issues, but I need to run it at various
>>> settings and see if I can find the issue.
>>
>> No stalls seen, I get the same performance with it disabled and with
>> it enabled, at both default settings, and lower ones
>> (wb_percent=20). Looking at iostat, we don't drive a lot of depth,
>> so it makes sense, even with the throttling we're doing essentially
>> the same amount of IO.
>
> Try appending numa=fake=4 to your guest's kernel command line.
>
> (that's what I'm using)

Sure, I can give that a go.

>> What does 'nr_requests' say for your virtio_blk device? Looks like
>> virtio_blk has a queue_depth setting, but it's not set by default,
>> and then it uses the free entries in the ring. But I don't know what
>> that is...
>
> $ cat /sys/block/vdc/queue/nr_requests
> 128

OK, so that would put you in the 16/32/64 category for idle/normal/high
priority writeback. Which fits with the iostat below, which is in the
~16 range.

So the META thing should help, it'll bump it up a bit. But we're also
seeing smaller requests, and I think that could be because after we do
throttle, we could potentially have a merge candidate. The code doesn't
check post-sleeping, it'll allow any merges before though. Though that
part is a little harder to read from the iostat numbers, but there does
seem to be a correlation between your higher depths and bigger request
sizes.

> I'll try the "don't throttle REQ_META" patch, but this seems like a
> fragile way to solve this problem - it shuts up the messenger, but
> doesn't solve the problem for any other subsystem that might have a
> similer issue. e.g. next we're going to have to make sure direct IO
> (which is also REQ_WRITE dispatch) does not get throttled, and so
> on....

I don't think there's anything wrong with the REQ_META patch. Sure, we
could have better classifications (like discussed below), but that's
mainly tweaking. As long as we get the same answers, it's fine. There's
no throttling of O_DIRECT writes in the current code, it specifically
doesn't include those. It's only for the unbounded writes, which
writeback tends to be.

> It seems to me that the right thing to do here is add a separate
> classification flag for IO that can be throttled. e.g. as
> REQ_WRITEBACK and only background writeback work sets this flag.
> That would ensure that when the IO is being dispatched from other
> sources (e.g. fsync, sync_file_range(), direct IO, filesystem
> metadata, etc) it is clear that it is not a target for throttling.
> This would also allow us to easily switch off throttling if
> writeback is occurring for memory reclaim reasons, and so on.
> Throttling policy decisions belong above the block layer, even
> though the throttle mechanism itself is in the block layer.

We're already doing all of that, it's just doesn't include a specific
REQ_WRITEBACK flag. And yeah, that would clean up the checking for
request type, but functionally it should be the same as it is now. It'll
be a bit more robust and easier to read if we just have a REQ_WRITEBACK,
right now it's WRITE_SYNC vs WRITE for important vs not-important, with
a check for write vs O_DIRECT write as well.


--
Jens Axboe

2016-04-01 14:34:47

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCHSET v3][RFC] Make background writeback not suck

On 04/01/2016 12:27 AM, Dave Chinner wrote:
> On Thu, Mar 31, 2016 at 09:25:33PM -0600, Jens Axboe wrote:
>> On 03/31/2016 06:46 PM, Dave Chinner wrote:
>>>>> virtio in guest, XFS direct IO -> no-op -> scsi in host.
>>>>
>>>> That has write back caching enabled on the guest, correct?
>>>
>>> No. It uses virtio,cache=none (that's the "XFS Direct IO" bit above).
>>> Sorry for not being clear about that.
>>
>> That's fine, it's one less worry if that's not the case. So if you
>> cat the 'write_cache' file in the virtioblk sysfs block queue/
>> directory, it says 'write through'? Just want to confirm that we got
>> that propagated correctly.
>
> No such file. But I did find:
>
> $ cat /sys/block/vdc/cache_type
> write back
>
> Which is what I'd expect it to safe given the man page description
> of cache=none:
>
> Note that this is considered a writeback mode and the guest
> OS must handle the disk write cache correctly in order to
> avoid data corruption on host crashes.
>
> To make it say "write through" I need to use cache=directsync, but
> I have no need for such integrity guarantees on a volatile test
> device...

I wasn't as concerned about the integrity side, more if it's flagged as
write back then we induce further throttling. But I'll see if I can get
your test case reproduced, then I don't see why it can't get fixed. I'm
off all of next week though, so probably won't be until the week after...

--
Jens Axboe

2016-04-01 17:05:12

by Holger Hoffstätte

[permalink] [raw]
Subject: Re: [PATCHSET v3][RFC] Make background writeback not suck

On 04/01/16 03:01, Dave Chinner wrote:
> Can you go back to your original kernel, and lower nr_requests to 8?

Sure, did that and as expected it didn't help much. Under prolonged stress
it was actually even a bit worse than writeback throttling. IMHO that's not
really surprising either, since small queues now punish everyone and
in interactive mode I really want to e.g. start loading hundreds of small
thumbnails at once, or du a directory.

Instead of randomized aka manual/interactive testing I created a simple
stress tester:

#!/bin/sh
while [[ true ]]
do
cp bigfile bigfile.out
done

and running that in the background turns the system into a tar pit,
which is laughable when you consider that I have 24G and 8 cores.

With the writeback patchset and wb_percent=1 (yes, really!) it is almost
unnoticeable, but according to nmon still writes ~250-280 MB/s.
This is with deadline on ext4 on an older SATA-3 SSD that can still
do peak ~465 MB/s (with dd).

cheers,
Holger

2016-04-13 13:08:56

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 4/9] writeback: track if we're sleeping on progress in balance_dirty_pages()

On Wed 30-03-16 09:07:52, Jens Axboe wrote:
> Note in the bdi_writeback structure if a task is currently being
> limited in balance_dirty_pages(), waiting for writeback to
> proceed.
...
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 11ff8f758631..15e696bc5d14 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -1746,7 +1746,9 @@ pause:
> pause,
> start_time);
> __set_current_state(TASK_KILLABLE);
> + wb->dirty_sleeping = 1;
> io_schedule_timeout(pause);
> + wb->dirty_sleeping = 0;

Huh, but wb->dirty_sleeping is shared by all the processes in the system.
So this is seriously racy, isn't it? You rather need a counter for this to
work.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2016-04-13 14:20:19

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH 4/9] writeback: track if we're sleeping on progress in balance_dirty_pages()

On 04/13/2016 07:08 AM, Jan Kara wrote:
> On Wed 30-03-16 09:07:52, Jens Axboe wrote:
>> Note in the bdi_writeback structure if a task is currently being
>> limited in balance_dirty_pages(), waiting for writeback to
>> proceed.
> ...
>> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
>> index 11ff8f758631..15e696bc5d14 100644
>> --- a/mm/page-writeback.c
>> +++ b/mm/page-writeback.c
>> @@ -1746,7 +1746,9 @@ pause:
>> pause,
>> start_time);
>> __set_current_state(TASK_KILLABLE);
>> + wb->dirty_sleeping = 1;
>> io_schedule_timeout(pause);
>> + wb->dirty_sleeping = 0;
>
> Huh, but wb->dirty_sleeping is shared by all the processes in the system.
> So this is seriously racy, isn't it? You rather need a counter for this to
> work.

Sure, but it's not _that_ important. It's like wb->dirty_exceeded, we
have an equally relaxed relationship.

I don't mind making it more solid, but I can't make it a counter without
making it atomic. Which is why I left it as just a basic assignment. But
I guess since we only fiddle with it when going to sleep, we can make it
an atomic and not have to worry about the potential impact.

--
Jens Axboe