2016-03-22 17:55:32

by Jens Axboe

[permalink] [raw]
Subject: [PATCHSET][RFC] Make background writeback not suck

This patchset isn't as much a final solution, as it's demonstration
of what I believe is a huge issue. Since the dawn of time, our
background buffered writeback has sucked. When we do background
buffered writeback, it should have little impact on foreground
activity. That's the definition of background activity... But for as
long as I can remember, heavy buffered writers has not behaved like
that. For instance, if I do something like this:

$ dd if=/dev/zero of=foo bs=1M count=10k

on my laptop, and then try and start chrome, it basically won't start
before the buffered writeback is done. Or for server oriented workloads
where installation of a big RPM (or similar) adversely impacts data
base reads. When that happens, I get people yelling at me.

A quick demonstration - a fio job that reads a a file, while someone
else issues the above 'dd'. Run on a flash device, using XFS. The
vmstat output looks something like this:

--io---- -system-- ------cpu-----
bi bo in cs us sy id wa st
156 4648 58 151 0 1 98 1 0
0 0 64 83 0 0 100 0 0
0 32 76 119 0 0 100 0 0
26616 0 7574 13907 7 0 91 2 0
41992 0 10811 21395 0 2 95 3 0
46040 0 11836 23395 0 3 94 3 0
19376 1310736 5894 10080 0 4 93 3 0
116 1974296 1858 455 0 4 93 3 0
124 2020372 1964 545 0 4 92 4 0
112 1678356 1955 620 0 3 93 3 0
8560 405508 3759 4756 0 1 96 3 0
42496 0 10798 21566 0 0 97 3 0
42476 0 10788 21524 0 0 97 3 0

The read starts out fine, but goes to shit when we start bacckground
flushing. The reader experiences latency spikes in the seconds range.
On flash.

With this set of patches applies, the situation looks like this instead:

--io---- -system-- ------cpu-----
bi bo in cs us sy id wa st
33544 0 8650 17204 0 1 97 2 0
42488 0 10856 21756 0 0 97 3 0
42032 0 10719 21384 0 0 97 3 0
42544 12 10838 21631 0 0 97 3 0
42620 0 10982 21727 0 3 95 3 0
46392 0 11923 23597 0 3 94 3 0
36268 512000 9907 20044 0 3 91 5 0
31572 696324 8840 18248 0 1 91 7 0
30748 626692 8617 17636 0 2 91 6 0
31016 618504 8679 17736 0 3 91 6 0
30612 648196 8625 17624 0 3 91 6 0
30992 650296 8738 17859 0 3 91 6 0
30680 604075 8614 17605 0 3 92 6 0
30592 595040 8572 17564 0 2 92 6 0
31836 539656 8819 17962 0 2 92 5 0

and the reader never sees latency spikes above a few miliseconds.

The above was the why. The how is basically throttling background
writeback. We still want to issue big writes from the vm side of things,
so we get nice and big extents on the file system end. But we don't need
to flood the device with THOUSANDS of requests for background writeback.
For most devices, we don't need a whole lot to get decent throughput.

This adds some simple blk-wb code that keeps limits how much buffered
writeback we keep in flight on the device end. The default is pretty
low. If we end up switching to WB_SYNC_ALL, we up the limits. If the
dirtying task ends up being throttled in balance_dirty_pages(), we up
the limit. Currently there are tunables associated with this, see the
last patch for descriptions of those.

I welcome testing. The end goal here would be having much of this
auto-tuned, so that we don't lose substantial bandwidth for background
writes, while still maintaining decent non-wb performance and latencies.

You can also find this in a branch in the block git repo:

git://git.kernel.dk/linux-block.git wb-buf-throttle

Note that I rebase this branch when I collapse patches. Patches are
against current Linus' git, 4.5.0+.

block/Makefile | 2
block/blk-core.c | 21 +++
block/blk-lib.c | 1
block/blk-mq.c | 32 +++++
block/blk-settings.c | 11 +
block/blk-sysfs.c | 123 +++++++++++++++++++++
block/blk-wb.c | 219 +++++++++++++++++++++++++++++++++++++++
block/blk-wb.h | 24 ++++
block/cfq-iosched.c | 2
block/elevator.c | 6 -
drivers/nvme/host/core.c | 1
drivers/scsi/sd.c | 5
fs/fs-writeback.c | 5
include/linux/backing-dev-defs.h | 2
include/linux/blk_types.h | 2
include/linux/blkdev.h | 9 +
include/linux/elevator.h | 4
mm/page-writeback.c | 2
18 files changed, 456 insertions(+), 15 deletions(-)


--
Jens Axboe


2016-03-22 17:55:35

by Jens Axboe

[permalink] [raw]
Subject: [PATCH 1/6] block: ensure we don't truncate top bits of the request command flags

Some of the flags that we want to use from the make_request_fn path
are now larger than 32-bit, so change the functions involved to
accept an u64 instead of an unsigned int.

Signed-off-by: Jens Axboe <[email protected]>
---
block/blk-core.c | 7 ++++---
block/cfq-iosched.c | 2 +-
block/elevator.c | 6 +++---
include/linux/blkdev.h | 2 +-
include/linux/elevator.h | 4 ++--
5 files changed, 11 insertions(+), 10 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 827f8badd143..a9fe3d88af99 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1065,7 +1065,7 @@ static struct io_context *rq_ioc(struct bio *bio)
* Returns ERR_PTR on failure, with @q->queue_lock held.
* Returns request pointer on success, with @q->queue_lock *not held*.
*/
-static struct request *__get_request(struct request_list *rl, int rw_flags,
+static struct request *__get_request(struct request_list *rl, u64 rw_flags,
struct bio *bio, gfp_t gfp_mask)
{
struct request_queue *q = rl->q;
@@ -1237,7 +1237,7 @@ rq_starved:
* Returns ERR_PTR on failure, with @q->queue_lock held.
* Returns request pointer on success, with @q->queue_lock *not held*.
*/
-static struct request *get_request(struct request_queue *q, int rw_flags,
+static struct request *get_request(struct request_queue *q, u64 rw_flags,
struct bio *bio, gfp_t gfp_mask)
{
const bool is_sync = rw_is_sync(rw_flags) != 0;
@@ -1711,9 +1711,10 @@ static blk_qc_t blk_queue_bio(struct request_queue *q, struct bio *bio)
{
const bool sync = !!(bio->bi_rw & REQ_SYNC);
struct blk_plug *plug;
- int el_ret, rw_flags, where = ELEVATOR_INSERT_SORT;
+ int el_ret, where = ELEVATOR_INSERT_SORT;
struct request *req;
unsigned int request_count = 0;
+ u64 rw_flags;

/*
* low level driver can indicate that it wants pages above a
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index e3c591dd8f19..3e3b47b6a72f 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -4285,7 +4285,7 @@ static inline int __cfq_may_queue(struct cfq_queue *cfqq)
return ELV_MQUEUE_MAY;
}

-static int cfq_may_queue(struct request_queue *q, int rw)
+static int cfq_may_queue(struct request_queue *q, u64 rw)
{
struct cfq_data *cfqd = q->elevator->elevator_data;
struct task_struct *tsk = current;
diff --git a/block/elevator.c b/block/elevator.c
index c3555c9c672f..5b9a615fd2df 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -352,7 +352,7 @@ void elv_dispatch_sort(struct request_queue *q, struct request *rq)
{
sector_t boundary;
struct list_head *entry;
- int stop_flags;
+ u64 stop_flags;

if (q->last_merge == rq)
q->last_merge = NULL;
@@ -511,7 +511,7 @@ void elv_merge_requests(struct request_queue *q, struct request *rq,
struct request *next)
{
struct elevator_queue *e = q->elevator;
- const int next_sorted = next->cmd_flags & REQ_SORTED;
+ const bool next_sorted = (next->cmd_flags & REQ_SORTED) != 0;

if (next_sorted && e->type->ops.elevator_merge_req_fn)
e->type->ops.elevator_merge_req_fn(q, rq, next);
@@ -717,7 +717,7 @@ void elv_put_request(struct request_queue *q, struct request *rq)
e->type->ops.elevator_put_req_fn(rq);
}

-int elv_may_queue(struct request_queue *q, int rw)
+int elv_may_queue(struct request_queue *q, u64 rw)
{
struct elevator_queue *e = q->elevator;

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 7e5d7e018bea..930bd4c5b7ff 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -615,7 +615,7 @@ static inline unsigned int blk_queue_cluster(struct request_queue *q)
/*
* We regard a request as sync, if either a read or a sync write
*/
-static inline bool rw_is_sync(unsigned int rw_flags)
+static inline bool rw_is_sync(u64 rw_flags)
{
return !(rw_flags & REQ_WRITE) || (rw_flags & REQ_SYNC);
}
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index 638b324f0291..a06cca4d0f1a 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -26,7 +26,7 @@ typedef int (elevator_dispatch_fn) (struct request_queue *, int);
typedef void (elevator_add_req_fn) (struct request_queue *, struct request *);
typedef struct request *(elevator_request_list_fn) (struct request_queue *, struct request *);
typedef void (elevator_completed_req_fn) (struct request_queue *, struct request *);
-typedef int (elevator_may_queue_fn) (struct request_queue *, int);
+typedef int (elevator_may_queue_fn) (struct request_queue *, u64);

typedef void (elevator_init_icq_fn) (struct io_cq *);
typedef void (elevator_exit_icq_fn) (struct io_cq *);
@@ -134,7 +134,7 @@ extern struct request *elv_former_request(struct request_queue *, struct request
extern struct request *elv_latter_request(struct request_queue *, struct request *);
extern int elv_register_queue(struct request_queue *q);
extern void elv_unregister_queue(struct request_queue *q);
-extern int elv_may_queue(struct request_queue *, int);
+extern int elv_may_queue(struct request_queue *, u64);
extern void elv_completed_request(struct request_queue *, struct request *);
extern int elv_set_request(struct request_queue *q, struct request *rq,
struct bio *bio, gfp_t gfp_mask);
--
2.4.1.168.g1ea28e1

2016-03-22 17:55:42

by Jens Axboe

[permalink] [raw]
Subject: [PATCH 2/6] writeback: wb_start_writeback() should use WB_SYNC_ALL for WB_REASON_SYNC

If you call sync, the initial call to wakeup_flusher_threads() ends up
calling wb_start_writeback() with reason=WB_REASON_SYNC, but
wb_start_writeback() always uses WB_SYNC_NONE as the writeback mode.
Ensure that we use WB_SYNC_ALL for a sync operation.

Signed-off-by: Jens Axboe <[email protected]>
---
fs/fs-writeback.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 5c46ed9f3e14..97a9e9987134 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -936,7 +936,10 @@ void wb_start_writeback(struct bdi_writeback *wb, long nr_pages,
return;
}

- work->sync_mode = WB_SYNC_NONE;
+ if (reason == WB_REASON_SYNC)
+ work->sync_mode = WB_SYNC_ALL;
+ else
+ work->sync_mode = WB_SYNC_NONE;
work->nr_pages = nr_pages;
work->range_cyclic = range_cyclic;
work->reason = reason;
--
2.4.1.168.g1ea28e1

2016-03-22 17:55:51

by Jens Axboe

[permalink] [raw]
Subject: [PATCH 6/6] writeback: throttle buffered writeback

Test patch that throttles buffered writeback to make it a lot
more smooth, and has way less impact on other system activity.
Background writeback should be, by definition, background
activity. The fact that we flush huge bundles of it at the time
means that it potentially has heavy impacts on foreground workloads,
which isn't ideal. We can't easily limit the sizes of writes that
we do, since that would impact file system layout in the presence
of delayed allocation. So just throttle back buffered writeback,
unless someone is waiting for it.

Would likely need a dynamic adaption to the current device, this
one has only been tested on NVMe. But it brings down background
activity impact from 1-2s to tens of milliseconds instead.

This is just a test patch, and as such, it registers a queue sysfs
entry to both monitor the current state:

$ cat /sys/block/nvme0n1/queue/wb_stats
limit=4, batch=2, inflight=0, wait=0, timer=0

'limit' denotes how many requests we will allow inflight for buffered
writeback, and this can be changed by just writing an integer to this
file. 0 turns off the feature. 'inflight' shows how many requests
are currently inflight for buffered writeback, and 'wait' shows if
anyone is currently waiting for access.

The 'wb_depth' file sets the depth limit for background writeback.
If the device has write back caching, 'wb_cache_delay' delays by
this amount of usecs when a write completes before allowing more.

Signed-off-by: Jens Axboe <[email protected]>
---
block/Makefile | 2 +-
block/blk-core.c | 14 +++
block/blk-lib.c | 1 +
block/blk-mq.c | 32 +++++-
block/blk-sysfs.c | 85 ++++++++++++++-
block/blk-wb.c | 219 +++++++++++++++++++++++++++++++++++++++
block/blk-wb.h | 24 +++++
include/linux/backing-dev-defs.h | 2 +
include/linux/blk_types.h | 2 +
include/linux/blkdev.h | 3 +
mm/page-writeback.c | 2 +
11 files changed, 382 insertions(+), 4 deletions(-)
create mode 100644 block/blk-wb.c
create mode 100644 block/blk-wb.h

diff --git a/block/Makefile b/block/Makefile
index 9eda2322b2d4..9df911a3b569 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -5,7 +5,7 @@
obj-$(CONFIG_BLOCK) := bio.o elevator.o blk-core.o blk-tag.o blk-sysfs.o \
blk-flush.o blk-settings.o blk-ioc.o blk-map.o \
blk-exec.o blk-merge.o blk-softirq.o blk-timeout.o \
- blk-lib.o blk-mq.o blk-mq-tag.o \
+ blk-lib.o blk-mq.o blk-mq-tag.o blk-wb.o \
blk-mq-sysfs.o blk-mq-cpu.o blk-mq-cpumap.o ioctl.o \
genhd.o scsi_ioctl.o partition-generic.o ioprio.o \
badblocks.o partitions/
diff --git a/block/blk-core.c b/block/blk-core.c
index a9fe3d88af99..24cc0481e3ca 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -39,6 +39,7 @@

#include "blk.h"
#include "blk-mq.h"
+#include "blk-wb.h"

EXPORT_TRACEPOINT_SYMBOL_GPL(block_bio_remap);
EXPORT_TRACEPOINT_SYMBOL_GPL(block_rq_remap);
@@ -848,6 +849,9 @@ blk_init_allocated_queue(struct request_queue *q, request_fn_proc *rfn,
if (blk_init_rl(&q->root_rl, q, GFP_KERNEL))
goto fail;

+ if (blk_buffered_writeback_init(q))
+ goto fail;
+
INIT_WORK(&q->timeout_work, blk_timeout_work);
q->request_fn = rfn;
q->prep_rq_fn = NULL;
@@ -880,6 +884,7 @@ blk_init_allocated_queue(struct request_queue *q, request_fn_proc *rfn,

fail:
blk_free_flush_queue(q->fq);
+ blk_buffered_writeback_exit(q);
return NULL;
}
EXPORT_SYMBOL(blk_init_allocated_queue);
@@ -1485,6 +1490,8 @@ void __blk_put_request(struct request_queue *q, struct request *req)
/* this is a bio leak */
WARN_ON(req->bio != NULL);

+ blk_buffered_writeback_done(q->rq_wb, req);
+
/*
* Request may not have originated from ll_rw_blk. if not,
* it didn't come out of our reserved rq pools
@@ -1715,6 +1722,7 @@ static blk_qc_t blk_queue_bio(struct request_queue *q, struct bio *bio)
struct request *req;
unsigned int request_count = 0;
u64 rw_flags;
+ bool wb_acct;

/*
* low level driver can indicate that it wants pages above a
@@ -1767,6 +1775,8 @@ static blk_qc_t blk_queue_bio(struct request_queue *q, struct bio *bio)
}

get_rq:
+ wb_acct = blk_buffered_writeback_wait(q->rq_wb, bio, q->queue_lock);
+
/*
* This sync check and mask will be re-done in init_request_from_bio(),
* but we need to set it earlier to expose the sync flag to the
@@ -1775,6 +1785,8 @@ get_rq:
rw_flags = bio_data_dir(bio);
if (sync)
rw_flags |= REQ_SYNC;
+ if (wb_acct)
+ rw_flags |= REQ_BUF_INFLIGHT;

/*
* Grab a free request. This is might sleep but can not fail.
@@ -1782,6 +1794,8 @@ get_rq:
*/
req = get_request(q, rw_flags, bio, GFP_NOIO);
if (IS_ERR(req)) {
+ if (wb_acct)
+ __blk_buffered_writeback_done(q->rq_wb);
bio->bi_error = PTR_ERR(req);
bio_endio(bio);
goto out_unlock;
diff --git a/block/blk-lib.c b/block/blk-lib.c
index 9ebf65379556..b39d92361f30 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -6,6 +6,7 @@
#include <linux/bio.h>
#include <linux/blkdev.h>
#include <linux/scatterlist.h>
+#include <linux/atomic.h>

#include "blk.h"

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 050f7a13021b..55aace97fd35 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -29,6 +29,7 @@
#include "blk.h"
#include "blk-mq.h"
#include "blk-mq-tag.h"
+#include "blk-wb.h"

static DEFINE_MUTEX(all_q_mutex);
static LIST_HEAD(all_q_list);
@@ -274,6 +275,9 @@ static void __blk_mq_free_request(struct blk_mq_hw_ctx *hctx,

if (rq->cmd_flags & REQ_MQ_INFLIGHT)
atomic_dec(&hctx->nr_active);
+
+ blk_buffered_writeback_done(q->rq_wb, rq);
+
rq->cmd_flags = 0;

clear_bit(REQ_ATOM_STARTED, &rq->atomic_flags);
@@ -1253,6 +1257,7 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
struct blk_plug *plug;
struct request *same_queue_rq = NULL;
blk_qc_t cookie;
+ bool wb_acct;

blk_queue_bounce(q, &bio);

@@ -1270,9 +1275,17 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
} else
request_count = blk_plug_queued_count(q);

+ wb_acct = blk_buffered_writeback_wait(q->rq_wb, bio, NULL);
+
rq = blk_mq_map_request(q, bio, &data);
- if (unlikely(!rq))
+ if (unlikely(!rq)) {
+ if (wb_acct)
+ __blk_buffered_writeback_done(q->rq_wb);
return BLK_QC_T_NONE;
+ }
+
+ if (wb_acct)
+ rq->cmd_flags |= REQ_BUF_INFLIGHT;

cookie = blk_tag_to_qc_t(rq->tag, data.hctx->queue_num);

@@ -1349,6 +1362,7 @@ static blk_qc_t blk_sq_make_request(struct request_queue *q, struct bio *bio)
struct blk_map_ctx data;
struct request *rq;
blk_qc_t cookie;
+ bool wb_acct;

blk_queue_bounce(q, &bio);

@@ -1363,9 +1377,17 @@ static blk_qc_t blk_sq_make_request(struct request_queue *q, struct bio *bio)
blk_attempt_plug_merge(q, bio, &request_count, NULL))
return BLK_QC_T_NONE;

+ wb_acct = blk_buffered_writeback_wait(q->rq_wb, bio, NULL);
+
rq = blk_mq_map_request(q, bio, &data);
- if (unlikely(!rq))
+ if (unlikely(!rq)) {
+ if (wb_acct)
+ __blk_buffered_writeback_done(q->rq_wb);
return BLK_QC_T_NONE;
+ }
+
+ if (wb_acct)
+ rq->cmd_flags |= REQ_BUF_INFLIGHT;

cookie = blk_tag_to_qc_t(rq->tag, data.hctx->queue_num);

@@ -2018,6 +2040,9 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
/* mark the queue as mq asap */
q->mq_ops = set->ops;

+ if (blk_buffered_writeback_init(q))
+ return ERR_PTR(-ENOMEM);
+
q->queue_ctx = alloc_percpu(struct blk_mq_ctx);
if (!q->queue_ctx)
return ERR_PTR(-ENOMEM);
@@ -2084,6 +2109,7 @@ err_map:
kfree(q->queue_hw_ctx);
err_percpu:
free_percpu(q->queue_ctx);
+ blk_buffered_writeback_exit(q);
return ERR_PTR(-ENOMEM);
}
EXPORT_SYMBOL(blk_mq_init_allocated_queue);
@@ -2096,6 +2122,8 @@ void blk_mq_free_queue(struct request_queue *q)
list_del_init(&q->all_q_node);
mutex_unlock(&all_q_mutex);

+ blk_buffered_writeback_exit(q);
+
blk_mq_del_queue_tag_set(q);

blk_mq_exit_hw_queues(q, set, set->nr_hw_queues);
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index deb2270bf1f3..da5e0517b9af 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -13,6 +13,7 @@

#include "blk.h"
#include "blk-mq.h"
+#include "blk-wb.h"

struct queue_sysfs_entry {
struct attribute attr;
@@ -358,7 +359,6 @@ static ssize_t queue_wc_show(struct request_queue *q, char *page)
static ssize_t queue_wc_store(struct request_queue *q, const char *page,
size_t count)
{
- ssize_t ret;
int set = -1;

if (!strncmp(page, "write back", 10))
@@ -377,6 +377,71 @@ static ssize_t queue_wc_store(struct request_queue *q, const char *page,
queue_flag_clear(QUEUE_FLAG_WC, q);
spin_unlock_irq(q->queue_lock);

+ return count;
+}
+
+static ssize_t queue_wb_stats_show(struct request_queue *q, char *page)
+{
+ struct rq_wb *wb = q->rq_wb;
+
+ if (!q->rq_wb)
+ return -EINVAL;
+
+ return sprintf(page, "limit=%d, batch=%d, inflight=%d, wait=%d, timer=%d\n",
+ wb->limit, wb->batch, atomic_read(&wb->inflight),
+ waitqueue_active(&wb->wait), timer_pending(&wb->timer));
+}
+
+static ssize_t queue_wb_depth_show(struct request_queue *q, char *page)
+{
+ if (!q->rq_wb)
+ return -EINVAL;
+
+ return queue_var_show(q->rq_wb->limit, page);
+}
+
+static ssize_t queue_wb_depth_store(struct request_queue *q, const char *page,
+ size_t count)
+{
+ unsigned long var;
+ ssize_t ret;
+
+ if (!q->rq_wb)
+ return -EINVAL;
+
+ ret = queue_var_store(&var, page, count);
+ if (ret < 0)
+ return ret;
+ if (var != (unsigned int) var)
+ return -EINVAL;
+
+ blk_update_wb_limit(q->rq_wb, var);
+ return ret;
+}
+
+static ssize_t queue_wb_cache_delay_show(struct request_queue *q, char *page)
+{
+ if (!q->rq_wb)
+ return -EINVAL;
+
+ return queue_var_show(q->rq_wb->cache_delay_usecs, page);
+}
+
+static ssize_t queue_wb_cache_delay_store(struct request_queue *q,
+ const char *page, size_t count)
+{
+ unsigned long var;
+ ssize_t ret;
+
+ if (!q->rq_wb)
+ return -EINVAL;
+
+ ret = queue_var_store(&var, page, count);
+ if (ret < 0)
+ return ret;
+
+ q->rq_wb->cache_delay_usecs = var;
+ q->rq_wb->cache_delay = usecs_to_jiffies(var);
return ret;
}

@@ -517,6 +582,21 @@ static struct queue_sysfs_entry queue_wc_entry = {
.store = queue_wc_store,
};

+static struct queue_sysfs_entry queue_wb_stats_entry = {
+ .attr = {.name = "wb_stats", .mode = S_IRUGO },
+ .show = queue_wb_stats_show,
+};
+static struct queue_sysfs_entry queue_wb_cache_delay_entry = {
+ .attr = {.name = "wb_cache_usecs", .mode = S_IRUGO | S_IWUSR },
+ .show = queue_wb_cache_delay_show,
+ .store = queue_wb_cache_delay_store,
+};
+static struct queue_sysfs_entry queue_wb_depth_entry = {
+ .attr = {.name = "wb_depth", .mode = S_IRUGO | S_IWUSR },
+ .show = queue_wb_depth_show,
+ .store = queue_wb_depth_store,
+};
+
static struct attribute *default_attrs[] = {
&queue_requests_entry.attr,
&queue_ra_entry.attr,
@@ -543,6 +623,9 @@ static struct attribute *default_attrs[] = {
&queue_random_entry.attr,
&queue_poll_entry.attr,
&queue_wc_entry.attr,
+ &queue_wb_stats_entry.attr,
+ &queue_wb_cache_delay_entry.attr,
+ &queue_wb_depth_entry.attr,
NULL,
};

diff --git a/block/blk-wb.c b/block/blk-wb.c
new file mode 100644
index 000000000000..2aa3753a8e1e
--- /dev/null
+++ b/block/blk-wb.c
@@ -0,0 +1,219 @@
+/*
+ * buffered writeback throttling
+ *
+ * Copyright (C) 2016 Jens Axboe
+ *
+ * Things that need changing:
+ *
+ * - Auto-detection of most of this, no tunables. Cache type we can get,
+ * and most other settings we can tweak/gather based on time.
+ * - Better solution for rwb->bdp_wait?
+ * - Higher depth for WB_SYNC_ALL?
+ *
+ */
+#include <linux/kernel.h>
+#include <linux/bio.h>
+#include <linux/blkdev.h>
+
+#include "blk.h"
+#include "blk-wb.h"
+
+void __blk_buffered_writeback_done(struct rq_wb *rwb)
+{
+ int inflight;
+
+ inflight = atomic_dec_return(&rwb->inflight);
+ if (inflight >= rwb->limit)
+ return;
+
+ /*
+ * If the device does caching, we can still flood it with IO
+ * even at a low depth. If caching is on, delay a bit before
+ * submitting the next, if we're still purely background
+ * activity.
+ */
+ if (test_bit(QUEUE_FLAG_WC, &rwb->q->queue_flags) && !*rwb->bdp_wait &&
+ time_before(jiffies, rwb->last_comp + rwb->cache_delay)) {
+ if (!timer_pending(&rwb->timer))
+ mod_timer(&rwb->timer, jiffies + rwb->cache_delay);
+ return;
+ }
+
+ if (waitqueue_active(&rwb->wait)) {
+ int diff = rwb->limit - inflight;
+
+ if (diff >= rwb->batch)
+ wake_up_nr(&rwb->wait, 1);
+ }
+}
+
+/*
+ * Called on completion of a request. Note that it's also called when
+ * a request is merged, when the request gets freed.
+ */
+void blk_buffered_writeback_done(struct rq_wb *rwb, struct request *rq)
+{
+ if (!(rq->cmd_flags & REQ_BUF_INFLIGHT)) {
+ const unsigned long cur = jiffies;
+
+ if (rwb->limit && cur != rwb->last_comp)
+ rwb->last_comp = cur;
+ } else
+ __blk_buffered_writeback_done(rwb);
+}
+
+/*
+ * Increment 'v', if 'v' is below 'below'. Returns true if we succeeded,
+ * false if 'v' + 1 would be bigger than 'below'.
+ */
+static bool atomic_inc_below(atomic_t *v, int below)
+{
+ int cur = atomic_read(v);
+
+ for (;;) {
+ int old;
+
+ if (cur >= below)
+ return false;
+ old = atomic_cmpxchg(v, cur, cur + 1);
+ if (old == cur)
+ break;
+ cur = old;
+ }
+
+ return true;
+}
+
+/*
+ * Block if we will exceed our limit, or if we are currently waiting for
+ * the timer to kick off queuing again.
+ */
+static void __blk_buffered_writeback_wait(struct rq_wb *rwb, unsigned int limit,
+ spinlock_t *lock)
+{
+ DEFINE_WAIT(wait);
+
+ if (!timer_pending(&rwb->timer) &&
+ atomic_inc_below(&rwb->inflight, limit))
+ return;
+
+ do {
+ prepare_to_wait_exclusive(&rwb->wait, &wait,
+ TASK_UNINTERRUPTIBLE);
+
+ if (!timer_pending(&rwb->timer) &&
+ atomic_inc_below(&rwb->inflight, limit))
+ break;
+
+ if (lock)
+ spin_unlock_irq(lock);
+
+ io_schedule();
+
+ if (lock)
+ spin_lock_irq(lock);
+ } while (1);
+
+ finish_wait(&rwb->wait, &wait);
+}
+
+/*
+ * Returns true if the IO request should be accounted, false if not.
+ * May sleep, if we have exceeded the writeback limits. Caller can pass
+ * in an irq held spinlock, if it holds one when calling this function.
+ * If we do sleep, we'll release and re-grab it.
+ */
+bool blk_buffered_writeback_wait(struct rq_wb *rwb, struct bio *bio,
+ spinlock_t *lock)
+{
+ unsigned int limit;
+
+ /*
+ * If disabled, or not a WRITE (or a discard), do nothing
+ */
+ if (!rwb->limit || !(bio->bi_rw & REQ_WRITE) ||
+ (bio->bi_rw & REQ_DISCARD))
+ return false;
+
+ /*
+ * Don't throttle WRITE_ODIRECT
+ */
+ if ((bio->bi_rw & (REQ_SYNC | REQ_NOIDLE)) == REQ_SYNC)
+ return false;
+
+ /*
+ * At this point we know it's a buffered write. If REQ_SYNC is
+ * set, then it's WB_SYNC_ALL writeback. Bump the limit 4x for
+ * those, since someone is (or will be) waiting on that.
+ */
+ limit = rwb->limit;
+ if (bio->bi_rw & REQ_SYNC)
+ limit <<= 2;
+ else if (limit != 1) {
+ /*
+ * If less than 100ms since we completed unrelated IO,
+ * limit us to a depth of 1 for background writeback.
+ */
+ if (time_before(jiffies, rwb->last_comp + HZ / 10))
+ limit = 1;
+ else if (!*rwb->bdp_wait)
+ limit >>= 1;
+ }
+
+ __blk_buffered_writeback_wait(rwb, limit, lock);
+ return true;
+}
+
+void blk_update_wb_limit(struct rq_wb *rwb, unsigned int limit)
+{
+ rwb->limit = limit;
+ rwb->batch = rwb->limit / 2;
+ if (!rwb->batch && rwb->limit)
+ rwb->batch = 1;
+ else if (rwb->batch > 4)
+ rwb->batch = 4;
+
+ wake_up_all(&rwb->wait);
+}
+
+static void blk_buffered_writeback_timer(unsigned long data)
+{
+ struct rq_wb *rwb = (struct rq_wb *) data;
+
+ if (waitqueue_active(&rwb->wait))
+ wake_up_nr(&rwb->wait, 1);
+}
+
+#define DEF_WB_LIMIT 4
+#define DEF_WB_CACHE_DELAY 10000
+
+int blk_buffered_writeback_init(struct request_queue *q)
+{
+ struct rq_wb *rwb;
+
+ rwb = kzalloc(sizeof(*rwb), GFP_KERNEL);
+ if (!rwb)
+ return -ENOMEM;
+
+ atomic_set(&rwb->inflight, 0);
+ init_waitqueue_head(&rwb->wait);
+ rwb->last_comp = jiffies;
+ rwb->bdp_wait = &q->backing_dev_info.wb.dirty_sleeping;
+ setup_timer(&rwb->timer, blk_buffered_writeback_timer,
+ (unsigned long) rwb);
+ rwb->cache_delay_usecs = DEF_WB_CACHE_DELAY;
+ rwb->cache_delay = usecs_to_jiffies(rwb->cache_delay);
+ rwb->q = q;
+ blk_update_wb_limit(rwb, DEF_WB_LIMIT);
+ q->rq_wb = rwb;
+ return 0;
+}
+
+void blk_buffered_writeback_exit(struct request_queue *q)
+{
+ if (q->rq_wb)
+ del_timer_sync(&q->rq_wb->timer);
+
+ kfree(q->rq_wb);
+ q->rq_wb = NULL;
+}
diff --git a/block/blk-wb.h b/block/blk-wb.h
new file mode 100644
index 000000000000..f9d5dc817c80
--- /dev/null
+++ b/block/blk-wb.h
@@ -0,0 +1,24 @@
+#ifndef BLK_WB_H
+#define BLK_WB_H
+
+struct rq_wb {
+ unsigned int limit;
+ unsigned int batch;
+ unsigned int cache_delay;
+ unsigned int cache_delay_usecs;
+ unsigned long last_comp;
+ unsigned int *bdp_wait;
+ struct request_queue *q;
+ atomic_t inflight;
+ wait_queue_head_t wait;
+ struct timer_list timer;
+};
+
+void __blk_buffered_writeback_done(struct rq_wb *);
+void blk_buffered_writeback_done(struct rq_wb *, struct request *);
+bool blk_buffered_writeback_wait(struct rq_wb *, struct bio *, spinlock_t *);
+int blk_buffered_writeback_init(struct request_queue *);
+void blk_buffered_writeback_exit(struct request_queue *);
+void blk_update_wb_limit(struct rq_wb *, unsigned int);
+
+#endif
diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index 1b4d69f68c33..f702309216b4 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -116,6 +116,8 @@ struct bdi_writeback {
struct list_head work_list;
struct delayed_work dwork; /* work item used for writeback */

+ int dirty_sleeping; /* waiting on dirty limit exceeded */
+
struct list_head bdi_node; /* anchored at bdi->wb_list */

#ifdef CONFIG_CGROUP_WRITEBACK
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 86a38ea1823f..6f2a174b771c 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -188,6 +188,7 @@ enum rq_flag_bits {
__REQ_PM, /* runtime pm request */
__REQ_HASHED, /* on IO scheduler merge hash */
__REQ_MQ_INFLIGHT, /* track inflight for MQ */
+ __REQ_BUF_INFLIGHT, /* track inflight for buffered */
__REQ_NR_BITS, /* stops here */
};

@@ -241,6 +242,7 @@ enum rq_flag_bits {
#define REQ_PM (1ULL << __REQ_PM)
#define REQ_HASHED (1ULL << __REQ_HASHED)
#define REQ_MQ_INFLIGHT (1ULL << __REQ_MQ_INFLIGHT)
+#define REQ_BUF_INFLIGHT (1ULL << __REQ_BUF_INFLIGHT)

typedef unsigned int blk_qc_t;
#define BLK_QC_T_NONE -1U
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index da5e85c35318..b33e16c4f570 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -37,6 +37,7 @@ struct bsg_job;
struct blkcg_gq;
struct blk_flush_queue;
struct pr_ops;
+struct rq_wb;

#define BLKDEV_MIN_RQ 4
#define BLKDEV_MAX_RQ 128 /* Default maximum */
@@ -290,6 +291,8 @@ struct request_queue {
int nr_rqs[2]; /* # allocated [a]sync rqs */
int nr_rqs_elvpriv; /* # allocated rqs w/ elvpriv */

+ struct rq_wb *rq_wb;
+
/*
* If blkcg is not used, @q->root_rl serves all requests. If blkcg
* is used, root blkg allocates from @q->root_rl and all other
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 11ff8f758631..15e696bc5d14 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1746,7 +1746,9 @@ pause:
pause,
start_time);
__set_current_state(TASK_KILLABLE);
+ wb->dirty_sleeping = 1;
io_schedule_timeout(pause);
+ wb->dirty_sleeping = 0;

current->dirty_paused_when = now + pause;
current->nr_dirtied = 0;
--
2.4.1.168.g1ea28e1

2016-03-22 17:55:56

by Jens Axboe

[permalink] [raw]
Subject: [PATCH 5/6] NVMe: inform block layer of write cache state

This isn't quite correct, since the VWC merely states if a potential
write back cache is volatile or not. But for the purpose of write
absortion, it's good enough.

Signed-off-by: Jens Axboe <[email protected]>
---
drivers/nvme/host/core.c | 1 +
1 file changed, 1 insertion(+)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 643f457131c2..05c8edfb7611 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -906,6 +906,7 @@ static void nvme_set_queue_limits(struct nvme_ctrl *ctrl,
if (ctrl->vwc & NVME_CTRL_VWC_PRESENT)
blk_queue_flush(q, REQ_FLUSH | REQ_FUA);
blk_queue_virt_boundary(q, ctrl->page_size - 1);
+ blk_queue_write_cache(q, ctrl->vwc & NVME_CTRL_VWC_PRESENT);
}

/*
--
2.4.1.168.g1ea28e1

2016-03-22 17:56:40

by Jens Axboe

[permalink] [raw]
Subject: [PATCH 4/6] sd: inform block layer of write cache state

Signed-off-by: Jens Axboe <[email protected]>
---
drivers/scsi/sd.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 5a5457ac9cdb..049f424fb4ad 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -192,6 +192,7 @@ cache_type_store(struct device *dev, struct device_attribute *attr,
sdkp->WCE = wce;
sdkp->RCD = rcd;
sd_set_flush_flag(sdkp);
+ blk_queue_write_cache(sdp->request_queue, wce != 0);
return count;
}

@@ -2571,7 +2572,7 @@ sd_read_cache_type(struct scsi_disk *sdkp, unsigned char *buffer)
sdkp->DPOFUA ? "supports DPO and FUA"
: "doesn't support DPO or FUA");

- return;
+ goto done;
}

bad_sense:
@@ -2596,6 +2597,8 @@ defaults:
}
sdkp->RCD = 0;
sdkp->DPOFUA = 0;
+done:
+ blk_queue_write_cache(sdp->request_queue, sdkp->WCE != 0);
}

/*
--
2.4.1.168.g1ea28e1

2016-03-22 17:56:56

by Jens Axboe

[permalink] [raw]
Subject: [PATCH 3/6] block: add ability to flag write back caching on a device

Add an internal helper and flag for setting whether a queue has
write back caching, or write through (or none). Add a sysfs file
to show this as well, and make it changeable from user space.

Signed-off-by: Jens Axboe <[email protected]>
---
block/blk-settings.c | 11 +++++++++++
block/blk-sysfs.c | 40 ++++++++++++++++++++++++++++++++++++++++
include/linux/blkdev.h | 4 ++++
3 files changed, 55 insertions(+)

diff --git a/block/blk-settings.c b/block/blk-settings.c
index c7bb666aafd1..4dbd511a9889 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -846,6 +846,17 @@ void blk_queue_flush_queueable(struct request_queue *q, bool queueable)
}
EXPORT_SYMBOL_GPL(blk_queue_flush_queueable);

+void blk_queue_write_cache(struct request_queue *q, bool enabled)
+{
+ spin_lock_irq(q->queue_lock);
+ if (enabled)
+ queue_flag_set(QUEUE_FLAG_WC, q);
+ else
+ queue_flag_clear(QUEUE_FLAG_WC, q);
+ spin_unlock_irq(q->queue_lock);
+}
+EXPORT_SYMBOL_GPL(blk_queue_write_cache);
+
static int __init blk_settings_init(void)
{
blk_max_low_pfn = max_low_pfn - 1;
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index dd93763057ce..deb2270bf1f3 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -347,6 +347,39 @@ static ssize_t queue_poll_store(struct request_queue *q, const char *page,
return ret;
}

+static ssize_t queue_wc_show(struct request_queue *q, char *page)
+{
+ if (test_bit(QUEUE_FLAG_WC, &q->queue_flags))
+ return sprintf(page, "write back\n");
+
+ return sprintf(page, "write through\n");
+}
+
+static ssize_t queue_wc_store(struct request_queue *q, const char *page,
+ size_t count)
+{
+ ssize_t ret;
+ int set = -1;
+
+ if (!strncmp(page, "write back", 10))
+ set = 1;
+ else if (!strncmp(page, "write through", 13) ||
+ !strncmp(page, "none", 4))
+ set = 0;
+
+ if (set == -1)
+ return -EINVAL;
+
+ spin_lock_irq(q->queue_lock);
+ if (set)
+ queue_flag_set(QUEUE_FLAG_WC, q);
+ else
+ queue_flag_clear(QUEUE_FLAG_WC, q);
+ spin_unlock_irq(q->queue_lock);
+
+ return ret;
+}
+
static struct queue_sysfs_entry queue_requests_entry = {
.attr = {.name = "nr_requests", .mode = S_IRUGO | S_IWUSR },
.show = queue_requests_show,
@@ -478,6 +511,12 @@ static struct queue_sysfs_entry queue_poll_entry = {
.store = queue_poll_store,
};

+static struct queue_sysfs_entry queue_wc_entry = {
+ .attr = {.name = "write_cache", .mode = S_IRUGO | S_IWUSR },
+ .show = queue_wc_show,
+ .store = queue_wc_store,
+};
+
static struct attribute *default_attrs[] = {
&queue_requests_entry.attr,
&queue_ra_entry.attr,
@@ -503,6 +542,7 @@ static struct attribute *default_attrs[] = {
&queue_iostats_entry.attr,
&queue_random_entry.attr,
&queue_poll_entry.attr,
+ &queue_wc_entry.attr,
NULL,
};

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 930bd4c5b7ff..da5e85c35318 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -491,15 +491,18 @@ struct request_queue {
#define QUEUE_FLAG_INIT_DONE 20 /* queue is initialized */
#define QUEUE_FLAG_NO_SG_MERGE 21 /* don't attempt to merge SG segments*/
#define QUEUE_FLAG_POLL 22 /* IO polling enabled if set */
+#define QUEUE_FLAG_WC 23 /* Write back caching */

#define QUEUE_FLAG_DEFAULT ((1 << QUEUE_FLAG_IO_STAT) | \
(1 << QUEUE_FLAG_STACKABLE) | \
(1 << QUEUE_FLAG_SAME_COMP) | \
+ (1 << QUEUE_FLAG_WC) | \
(1 << QUEUE_FLAG_ADD_RANDOM))

#define QUEUE_FLAG_MQ_DEFAULT ((1 << QUEUE_FLAG_IO_STAT) | \
(1 << QUEUE_FLAG_STACKABLE) | \
(1 << QUEUE_FLAG_SAME_COMP) | \
+ (1 << QUEUE_FLAG_WC) | \
(1 << QUEUE_FLAG_POLL))

static inline void queue_lockdep_assert_held(struct request_queue *q)
@@ -1009,6 +1012,7 @@ extern void blk_queue_rq_timed_out(struct request_queue *, rq_timed_out_fn *);
extern void blk_queue_rq_timeout(struct request_queue *, unsigned int);
extern void blk_queue_flush(struct request_queue *q, unsigned int flush);
extern void blk_queue_flush_queueable(struct request_queue *q, bool queueable);
+extern void blk_queue_write_cache(struct request_queue *q, bool enabled);
extern struct backing_dev_info *blk_get_backing_dev_info(struct block_device *bdev);

extern int blk_rq_map_sg(struct request_queue *, struct request *, struct scatterlist *);
--
2.4.1.168.g1ea28e1

2016-03-22 18:57:29

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 3/6] block: add ability to flag write back caching on a device

On Tue, Mar 22, 2016 at 11:55:17AM -0600, Jens Axboe wrote:
> Add an internal helper and flag for setting whether a queue has
> write back caching, or write through (or none). Add a sysfs file
> to show this as well, and make it changeable from user space.

We do this by passing the REQ_FLUSH flag to blk_queue_flush today.
While that's not a great interface, adding a second one doesn't make it
any better :)

2016-03-22 18:59:06

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 1/6] block: ensure we don't truncate top bits of the request command flags

On Tue, Mar 22, 2016 at 11:55:15AM -0600, Jens Axboe wrote:
> Some of the flags that we want to use from the make_request_fn path
> are now larger than 32-bit, so change the functions involved to
> accept an u64 instead of an unsigned int.

When did we start doing that? We really should merge Mike's split
of the operation style flags into the cmd_type before making things
even worse in the flags area.

2016-03-22 19:00:20

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH 3/6] block: add ability to flag write back caching on a device

On 03/22/2016 12:57 PM, Christoph Hellwig wrote:
> On Tue, Mar 22, 2016 at 11:55:17AM -0600, Jens Axboe wrote:
>> Add an internal helper and flag for setting whether a queue has
>> write back caching, or write through (or none). Add a sysfs file
>> to show this as well, and make it changeable from user space.
>
> We do this by passing the REQ_FLUSH flag to blk_queue_flush today.
> While that's not a great interface, adding a second one doesn't make it
> any better :)

I think the newer one is cleaner, so would make more sense to put the
flush part on top.


--
Jens Axboe

2016-03-22 19:01:55

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH 1/6] block: ensure we don't truncate top bits of the request command flags

On 03/22/2016 12:59 PM, Christoph Hellwig wrote:
> On Tue, Mar 22, 2016 at 11:55:15AM -0600, Jens Axboe wrote:
>> Some of the flags that we want to use from the make_request_fn path
>> are now larger than 32-bit, so change the functions involved to
>> accept an u64 instead of an unsigned int.
>
> When did we start doing that? We really should merge Mike's split
> of the operation style flags into the cmd_type before making things
> even worse in the flags area.

Just now, and I ran into it last week as well, for a test patch on cfq
that passed in higher flags for get_request -> may_queue() as well. We
can do Mike's split first, I think it's a good cleanup. As a standalone
series, I needed it though.

--
Jens Axboe

2016-03-22 20:12:07

by Jeff Moyer

[permalink] [raw]
Subject: Re: [PATCH 6/6] writeback: throttle buffered writeback

Hi, Jens,

Jens Axboe <[email protected]> writes:

> If the device has write back caching, 'wb_cache_delay' delays by
> this amount of usecs when a write completes before allowing more.

What's the reason behind that?

Thanks!
Jeff

2016-03-22 20:19:41

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH 6/6] writeback: throttle buffered writeback

On 03/22/2016 02:12 PM, Jeff Moyer wrote:
> Hi, Jens,
>
> Jens Axboe <[email protected]> writes:
>
>> If the device has write back caching, 'wb_cache_delay' delays by
>> this amount of usecs when a write completes before allowing more.
>
> What's the reason behind that?

For classic write back caching, the cache can absorb a bunch of writes
shortly, which means that the completion cost only shows a small part of
the overall cost. This means that if we just throttle on completion,
then when the device starts committing to media, then we'll end up
starving other IO anyway. This knob is a way to attempt to tame that.

--
Jens Axboe

2016-03-22 20:27:34

by Jeff Moyer

[permalink] [raw]
Subject: Re: [PATCH 6/6] writeback: throttle buffered writeback

Jens Axboe <[email protected]> writes:

> On 03/22/2016 02:12 PM, Jeff Moyer wrote:
>> Hi, Jens,
>>
>> Jens Axboe <[email protected]> writes:
>>
>>> If the device has write back caching, 'wb_cache_delay' delays by
>>> this amount of usecs when a write completes before allowing more.
>>
>> What's the reason behind that?
>
> For classic write back caching, the cache can absorb a bunch of writes
> shortly, which means that the completion cost only shows a small part
> of the overall cost. This means that if we just throttle on
> completion, then when the device starts committing to media, then
> we'll end up starving other IO anyway. This knob is a way to attempt
> to tame that.

Ah, makes perfect sense. Thanks!

-Jeff

2016-03-22 21:30:06

by Shaohua Li

[permalink] [raw]
Subject: Re: [PATCH 6/6] writeback: throttle buffered writeback

On Tue, Mar 22, 2016 at 02:19:28PM -0600, Jens Axboe wrote:
> On 03/22/2016 02:12 PM, Jeff Moyer wrote:
> >Hi, Jens,
> >
> >Jens Axboe <[email protected]> writes:
> >
> >>If the device has write back caching, 'wb_cache_delay' delays by
> >>this amount of usecs when a write completes before allowing more.
> >
> >What's the reason behind that?
>
> For classic write back caching, the cache can absorb a bunch of writes
> shortly, which means that the completion cost only shows a small part of the
> overall cost. This means that if we just throttle on completion, then when
> the device starts committing to media, then we'll end up starving other IO
> anyway. This knob is a way to attempt to tame that.

Does request size matter? I think it's yes. If request size will be accounted,
there will be issue how to evaluate IO cost of each request, which is hard.

Looks the throttling is done regardless if there is other IO running, which
could hurt writeback.

Thanks,
Shaohua

2016-03-22 21:35:08

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 2/6] writeback: wb_start_writeback() should use WB_SYNC_ALL for WB_REASON_SYNC

On Tue, Mar 22, 2016 at 11:55:16AM -0600, Jens Axboe wrote:
> If you call sync, the initial call to wakeup_flusher_threads() ends up
> calling wb_start_writeback() with reason=WB_REASON_SYNC, but
> wb_start_writeback() always uses WB_SYNC_NONE as the writeback mode.
> Ensure that we use WB_SYNC_ALL for a sync operation.

This seems wrong to me. We want background write to happen as
quickly as possible and /not block/ when we first kick sync.

The latter blocking passes of sync use WB_SYNC_ALL to ensure that we
block waiting for all remaining IO to be issued and waited on, but
the background writeback doesn't need to do this.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2016-03-22 21:35:47

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH 6/6] writeback: throttle buffered writeback

On 03/22/2016 03:30 PM, Shaohua Li wrote:
> On Tue, Mar 22, 2016 at 02:19:28PM -0600, Jens Axboe wrote:
>> On 03/22/2016 02:12 PM, Jeff Moyer wrote:
>>> Hi, Jens,
>>>
>>> Jens Axboe <[email protected]> writes:
>>>
>>>> If the device has write back caching, 'wb_cache_delay' delays by
>>>> this amount of usecs when a write completes before allowing more.
>>>
>>> What's the reason behind that?
>>
>> For classic write back caching, the cache can absorb a bunch of writes
>> shortly, which means that the completion cost only shows a small part of the
>> overall cost. This means that if we just throttle on completion, then when
>> the device starts committing to media, then we'll end up starving other IO
>> anyway. This knob is a way to attempt to tame that.
>
> Does request size matter? I think it's yes. If request size will be accounted,
> there will be issue how to evaluate IO cost of each request, which is hard.

The code currently deliberately ignores it, since we do the throttling
checks post merging. We can experiment with doing it on a per-request
basis. I didn't want to complicate it too much, in my testing, for this
sort of application, the size of the request doesn't matter too much.
That's mainly because we, by default, bound the size. If it was
unbounded, then that would be different.

> Looks the throttling is done regardless if there is other IO running, which
> could hurt writeback.

I wanted to make the first cut very tough on the writes. We always want
to throttle, but perhaps not as much as we do now. But you'd be
surprised how close this basic low depth gets to ideal performance, on
most devices!

Background writeback does not have to be at 100% or 99% of the device
capability. If we sync or wait on it, then yes, we want it to go really
fast. And it should.

--
Jens Axboe

2016-03-22 21:40:33

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH 2/6] writeback: wb_start_writeback() should use WB_SYNC_ALL for WB_REASON_SYNC

On 03/22/2016 03:34 PM, Dave Chinner wrote:
> On Tue, Mar 22, 2016 at 11:55:16AM -0600, Jens Axboe wrote:
>> If you call sync, the initial call to wakeup_flusher_threads() ends up
>> calling wb_start_writeback() with reason=WB_REASON_SYNC, but
>> wb_start_writeback() always uses WB_SYNC_NONE as the writeback mode.
>> Ensure that we use WB_SYNC_ALL for a sync operation.
>
> This seems wrong to me. We want background write to happen as
> quickly as possible and /not block/ when we first kick sync.

It's not going to block. wakeup_flusher_threads() async queues writeback
work through wb_start_writeback().

> The latter blocking passes of sync use WB_SYNC_ALL to ensure that we
> block waiting for all remaining IO to be issued and waited on, but
> the background writeback doesn't need to do this.

That's fine, they can get to wait on the previously issued IO, which was
properly submitted with WB_SYNC_ALL.

Maybe I'm missing your point?

--
Jens Axboe

2016-03-22 21:51:10

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH 2/6] writeback: wb_start_writeback() should use WB_SYNC_ALL for WB_REASON_SYNC

On 03/22/2016 03:40 PM, Jens Axboe wrote:
> On 03/22/2016 03:34 PM, Dave Chinner wrote:
>> On Tue, Mar 22, 2016 at 11:55:16AM -0600, Jens Axboe wrote:
>>> If you call sync, the initial call to wakeup_flusher_threads() ends up
>>> calling wb_start_writeback() with reason=WB_REASON_SYNC, but
>>> wb_start_writeback() always uses WB_SYNC_NONE as the writeback mode.
>>> Ensure that we use WB_SYNC_ALL for a sync operation.
>>
>> This seems wrong to me. We want background write to happen as
>> quickly as possible and /not block/ when we first kick sync.
>
> It's not going to block. wakeup_flusher_threads() async queues writeback
> work through wb_start_writeback().

For block here, you mean the async work ending up doing
wait_on_page_writeback() because we're doing WB_SYNC_ALL instead of
WB_SYNC_NONE?

And if so:

>> The latter blocking passes of sync use WB_SYNC_ALL to ensure that we
>> block waiting for all remaining IO to be issued and waited on, but
>> the background writeback doesn't need to do this.

why not have it do that?

--
Jens Axboe

2016-03-22 21:51:48

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCHSET][RFC] Make background writeback not suck

On Tue, Mar 22, 2016 at 11:55:14AM -0600, Jens Axboe wrote:
> This patchset isn't as much a final solution, as it's demonstration
> of what I believe is a huge issue. Since the dawn of time, our
> background buffered writeback has sucked. When we do background
> buffered writeback, it should have little impact on foreground
> activity. That's the definition of background activity... But for as
> long as I can remember, heavy buffered writers has not behaved like
> that.

Of course not. The IO scheduler is supposed to determine how we
meter out bulk vs latency sensitive IO that is queued. That's what
all the things like anticipatory scheduling for read requests was
supposed to address....

I'm guessing you're seeing problems like this because blk-mq has no
IO scheduler infrastructure and so no way of prioritising,
scheduling and/or throttling different types of IO? Would that be
accurate?

> For instance, if I do something like this:
>
> $ dd if=/dev/zero of=foo bs=1M count=10k
>
> on my laptop, and then try and start chrome, it basically won't start
> before the buffered writeback is done. Or for server oriented workloads
> where installation of a big RPM (or similar) adversely impacts data
> base reads. When that happens, I get people yelling at me.
>
> A quick demonstration - a fio job that reads a a file, while someone
> else issues the above 'dd'. Run on a flash device, using XFS. The
> vmstat output looks something like this:
>
> --io---- -system-- ------cpu-----
> bi bo in cs us sy id wa st
> 156 4648 58 151 0 1 98 1 0
> 0 0 64 83 0 0 100 0 0
> 0 32 76 119 0 0 100 0 0
> 26616 0 7574 13907 7 0 91 2 0
> 41992 0 10811 21395 0 2 95 3 0
> 46040 0 11836 23395 0 3 94 3 0
> 19376 1310736 5894 10080 0 4 93 3 0
> 116 1974296 1858 455 0 4 93 3 0
> 124 2020372 1964 545 0 4 92 4 0
> 112 1678356 1955 620 0 3 93 3 0
> 8560 405508 3759 4756 0 1 96 3 0
> 42496 0 10798 21566 0 0 97 3 0
> 42476 0 10788 21524 0 0 97 3 0

So writeback is running at about 2GB/s, meaning the memory is
cleaned in about 5s.

> The read starts out fine, but goes to shit when we start bacckground
> flushing. The reader experiences latency spikes in the seconds range.
> On flash.
>
> With this set of patches applies, the situation looks like this instead:
>
> --io---- -system-- ------cpu-----
> bi bo in cs us sy id wa st
> 33544 0 8650 17204 0 1 97 2 0
> 42488 0 10856 21756 0 0 97 3 0
> 42032 0 10719 21384 0 0 97 3 0
> 42544 12 10838 21631 0 0 97 3 0
> 42620 0 10982 21727 0 3 95 3 0
> 46392 0 11923 23597 0 3 94 3 0
> 36268 512000 9907 20044 0 3 91 5 0
> 31572 696324 8840 18248 0 1 91 7 0
> 30748 626692 8617 17636 0 2 91 6 0
> 31016 618504 8679 17736 0 3 91 6 0
> 30612 648196 8625 17624 0 3 91 6 0
> 30992 650296 8738 17859 0 3 91 6 0
> 30680 604075 8614 17605 0 3 92 6 0
> 30592 595040 8572 17564 0 2 92 6 0
> 31836 539656 8819 17962 0 2 92 5 0

And now it runs at ~600MB/s, slowing down the rate at which memory
is cleaned by 60%.

Given that background writeback is relied on by memory reclaim to
clean memory faster than the LRUs are cycled, I suspect this is
going to have a big impact on low memory behaviour and balance,
which will then feed into IO breakdown problems caused by writeback
being driven from the LRUs rather than the flusher threads.....

> The above was the why. The how is basically throttling background
> writeback. We still want to issue big writes from the vm side of things,
> so we get nice and big extents on the file system end. But we don't need
> to flood the device with THOUSANDS of requests for background writeback.
> For most devices, we don't need a whole lot to get decent throughput.

Except, when the system is busy (e.g. CPU busy) and the writeback
threads can be starved of CPU by other operations, the writeback
queue depth needs to go way up so that we don't end up with idle
devices because the flusher threads are starved of CPU....

> This adds some simple blk-wb code that keeps limits how much buffered
> writeback we keep in flight on the device end. The default is pretty
> low. If we end up switching to WB_SYNC_ALL, we up the limits. If the
> dirtying task ends up being throttled in balance_dirty_pages(), we up
> the limit. Currently there are tunables associated with this, see the
> last patch for descriptions of those.
>
> I welcome testing. The end goal here would be having much of this
> auto-tuned, so that we don't lose substantial bandwidth for
> background writes, while still maintaining decent non-wb
> performance and latencies.

Right, another layer of "writeback tunables" is not really a
desirable outcome. We spent a lot of time making the dirty page
cache flushing not need tunables (i.e. via careful design of closed
loop feedback systems), so I think that if we're going to add a new
layer of throttling, we need to do the same thing. i.e. it needs to
adapt automatically and correctly to changing loads and workloads.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2016-03-22 22:03:58

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCHSET][RFC] Make background writeback not suck

On 03/22/2016 03:51 PM, Dave Chinner wrote:
> On Tue, Mar 22, 2016 at 11:55:14AM -0600, Jens Axboe wrote:
>> This patchset isn't as much a final solution, as it's demonstration
>> of what I believe is a huge issue. Since the dawn of time, our
>> background buffered writeback has sucked. When we do background
>> buffered writeback, it should have little impact on foreground
>> activity. That's the definition of background activity... But for as
>> long as I can remember, heavy buffered writers has not behaved like
>> that.
>
> Of course not. The IO scheduler is supposed to determine how we
> meter out bulk vs latency sensitive IO that is queued. That's what
> all the things like anticipatory scheduling for read requests was
> supposed to address....
>
> I'm guessing you're seeing problems like this because blk-mq has no
> IO scheduler infrastructure and so no way of prioritising,
> scheduling and/or throttling different types of IO? Would that be
> accurate?

It's not just that, but obviously the IO scheduler would be one place to
throttle it. This, in a way, is a way of scheduling the writeback writes
better. But most of the reports I get on writeback sucking is not using
scsi/blk-mq, they end up being "classic" on things like deadline.

>> For instance, if I do something like this:
>>
>> $ dd if=/dev/zero of=foo bs=1M count=10k
>>
>> on my laptop, and then try and start chrome, it basically won't start
>> before the buffered writeback is done. Or for server oriented workloads
>> where installation of a big RPM (or similar) adversely impacts data
>> base reads. When that happens, I get people yelling at me.
>>
>> A quick demonstration - a fio job that reads a a file, while someone
>> else issues the above 'dd'. Run on a flash device, using XFS. The
>> vmstat output looks something like this:
>>
>> --io---- -system-- ------cpu-----
>> bi bo in cs us sy id wa st
>> 156 4648 58 151 0 1 98 1 0
>> 0 0 64 83 0 0 100 0 0
>> 0 32 76 119 0 0 100 0 0
>> 26616 0 7574 13907 7 0 91 2 0
>> 41992 0 10811 21395 0 2 95 3 0
>> 46040 0 11836 23395 0 3 94 3 0
>> 19376 1310736 5894 10080 0 4 93 3 0
>> 116 1974296 1858 455 0 4 93 3 0
>> 124 2020372 1964 545 0 4 92 4 0
>> 112 1678356 1955 620 0 3 93 3 0
>> 8560 405508 3759 4756 0 1 96 3 0
>> 42496 0 10798 21566 0 0 97 3 0
>> 42476 0 10788 21524 0 0 97 3 0
>
> So writeback is running at about 2GB/s, meaning the memory is
> cleaned in about 5s.

Correct, and at the same time destroying anything else that runs on the
disk. For most use cases, not ideal. If we get in a tighter spot on
memory or someone waits on it, yes, we should ramp up. But not for
background cleaning.

>> The read starts out fine, but goes to shit when we start bacckground
>> flushing. The reader experiences latency spikes in the seconds range.
>> On flash.
>>
>> With this set of patches applies, the situation looks like this instead:
>>
>> --io---- -system-- ------cpu-----
>> bi bo in cs us sy id wa st
>> 33544 0 8650 17204 0 1 97 2 0
>> 42488 0 10856 21756 0 0 97 3 0
>> 42032 0 10719 21384 0 0 97 3 0
>> 42544 12 10838 21631 0 0 97 3 0
>> 42620 0 10982 21727 0 3 95 3 0
>> 46392 0 11923 23597 0 3 94 3 0
>> 36268 512000 9907 20044 0 3 91 5 0
>> 31572 696324 8840 18248 0 1 91 7 0
>> 30748 626692 8617 17636 0 2 91 6 0
>> 31016 618504 8679 17736 0 3 91 6 0
>> 30612 648196 8625 17624 0 3 91 6 0
>> 30992 650296 8738 17859 0 3 91 6 0
>> 30680 604075 8614 17605 0 3 92 6 0
>> 30592 595040 8572 17564 0 2 92 6 0
>> 31836 539656 8819 17962 0 2 92 5 0
>
> And now it runs at ~600MB/s, slowing down the rate at which memory
> is cleaned by 60%.

Which is the point, correct... If we're not anywhere near being tight on
memory AND nobody is waiting for this IO, then by definition, the
foreground activity is the important one. For the case used here, that's
the application doing reads.

> Given that background writeback is relied on by memory reclaim to
> clean memory faster than the LRUs are cycled, I suspect this is
> going to have a big impact on low memory behaviour and balance,
> which will then feed into IO breakdown problems caused by writeback
> being driven from the LRUs rather than the flusher threads.....

You're missing the part where the intent is to only throttle it heavily
when it's pure background writeback. Of course, if we are low on memory
and doing reclaim, we should get much closer to device bandwidth.

If I run the above dd without the reader running, I'm already at 90% of
the device bandwidth - not quite all the way there, since I still want
to quickly be able to inject reads (or other IO) without having to wait
for the queues to purge thousands of requests.

>> The above was the why. The how is basically throttling background
>> writeback. We still want to issue big writes from the vm side of things,
>> so we get nice and big extents on the file system end. But we don't need
>> to flood the device with THOUSANDS of requests for background writeback.
>> For most devices, we don't need a whole lot to get decent throughput.
>
> Except, when the system is busy (e.g. CPU busy) and the writeback
> threads can be starved of CPU by other operations, the writeback
> queue depth needs to go way up so that we don't end up with idle
> devices because the flusher threads are starved of CPU....

Sure, writeback always needs to make stable progress.

>> This adds some simple blk-wb code that keeps limits how much buffered
>> writeback we keep in flight on the device end. The default is pretty
>> low. If we end up switching to WB_SYNC_ALL, we up the limits. If the
>> dirtying task ends up being throttled in balance_dirty_pages(), we up
>> the limit. Currently there are tunables associated with this, see the
>> last patch for descriptions of those.
>>
>> I welcome testing. The end goal here would be having much of this
>> auto-tuned, so that we don't lose substantial bandwidth for
>> background writes, while still maintaining decent non-wb
>> performance and latencies.
>
> Right, another layer of "writeback tunables" is not really a
> desirable outcome. We spent a lot of time making the dirty page
> cache flushing not need tunables (i.e. via careful design of closed
> loop feedback systems), so I think that if we're going to add a new
> layer of throttling, we need to do the same thing. i.e. it needs to
> adapt automatically and correctly to changing loads and workloads.

Fully agree, and that's what I stated as well. The current patchset is a
way to experiment with improving background writeback, that's both in
the very first paragraph of this email, and in the blk-wb.c file as
well. I'm not a huge fan of tunables, nobody touches them, and we need
to get it right out of the box.

I've already removed one set of tunables from this posting compared to
what I had a week ago, it's moving in that direction.

--
Jens Axboe

2016-03-22 22:04:38

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 2/6] writeback: wb_start_writeback() should use WB_SYNC_ALL for WB_REASON_SYNC

On Tue, Mar 22, 2016 at 03:40:28PM -0600, Jens Axboe wrote:
> On 03/22/2016 03:34 PM, Dave Chinner wrote:
> >On Tue, Mar 22, 2016 at 11:55:16AM -0600, Jens Axboe wrote:
> >>If you call sync, the initial call to wakeup_flusher_threads() ends up
> >>calling wb_start_writeback() with reason=WB_REASON_SYNC, but
> >>wb_start_writeback() always uses WB_SYNC_NONE as the writeback mode.
> >>Ensure that we use WB_SYNC_ALL for a sync operation.
> >
> >This seems wrong to me. We want background write to happen as
> >quickly as possible and /not block/ when we first kick sync.
>
> It's not going to block. wakeup_flusher_threads() async queues
> writeback work through wb_start_writeback().

The flusher threads block, not the initial wakeup. e.g. they will
now block waiting for data writeback to complete before writing the
inode. i.e. this code in __writeback_single_inode() is now triggered
by the background flusher:

/*
* Make sure to wait on the data before writing out the metadata.
* This is important for filesystems that modify metadata on data
* I/O completion. We don't do it for sync(2) writeback because it has a
* separate, external IO completion path and ->sync_fs for guaranteeing
* inode metadata is written back correctly.
*/
if (wbc->sync_mode == WB_SYNC_ALL && !wbc->for_sync) {
int err = filemap_fdatawait(mapping);
if (ret == 0)
ret = err;
}

It also changes the writeback chunk size in write_cache_pages(), so
instead of doing a bit of writeback from all dirty inodes, it tries
to write everything from each inode in turn (see
writeback_chunk_size()) which will further exacerbate the wait
above.

> >The latter blocking passes of sync use WB_SYNC_ALL to ensure that we
> >block waiting for all remaining IO to be issued and waited on, but
> >the background writeback doesn't need to do this.
>
> That's fine, they can get to wait on the previously issued IO, which
> was properly submitted with WB_SYNC_ALL.
>
> Maybe I'm missing your point?

Making the background flusher block and wait for data makes it
completely ineffective in speeding up sync() processing...

Cheers,

Dave.
--
Dave Chinner
[email protected]

2016-03-22 22:07:25

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH 2/6] writeback: wb_start_writeback() should use WB_SYNC_ALL for WB_REASON_SYNC

On 03/22/2016 04:04 PM, Dave Chinner wrote:
> On Tue, Mar 22, 2016 at 03:40:28PM -0600, Jens Axboe wrote:
>> On 03/22/2016 03:34 PM, Dave Chinner wrote:
>>> On Tue, Mar 22, 2016 at 11:55:16AM -0600, Jens Axboe wrote:
>>>> If you call sync, the initial call to wakeup_flusher_threads() ends up
>>>> calling wb_start_writeback() with reason=WB_REASON_SYNC, but
>>>> wb_start_writeback() always uses WB_SYNC_NONE as the writeback mode.
>>>> Ensure that we use WB_SYNC_ALL for a sync operation.
>>>
>>> This seems wrong to me. We want background write to happen as
>>> quickly as possible and /not block/ when we first kick sync.
>>
>> It's not going to block. wakeup_flusher_threads() async queues
>> writeback work through wb_start_writeback().
>
> The flusher threads block, not the initial wakeup. e.g. they will
> now block waiting for data writeback to complete before writing the
> inode. i.e. this code in __writeback_single_inode() is now triggered
> by the background flusher:
>
> /*
> * Make sure to wait on the data before writing out the metadata.
> * This is important for filesystems that modify metadata on data
> * I/O completion. We don't do it for sync(2) writeback because it has a
> * separate, external IO completion path and ->sync_fs for guaranteeing
> * inode metadata is written back correctly.
> */
> if (wbc->sync_mode == WB_SYNC_ALL && !wbc->for_sync) {
> int err = filemap_fdatawait(mapping);
> if (ret == 0)
> ret = err;
> }

Yeah, that's not ideal, for this case we'd really like something that
WB_SYNC_ALL_NOWAIT...

> It also changes the writeback chunk size in write_cache_pages(), so
> instead of doing a bit of writeback from all dirty inodes, it tries
> to write everything from each inode in turn (see
> writeback_chunk_size()) which will further exacerbate the wait
> above.

But that part is fine, if it wasn't for the waiting.

>>> The latter blocking passes of sync use WB_SYNC_ALL to ensure that we
>>> block waiting for all remaining IO to be issued and waited on, but
>>> the background writeback doesn't need to do this.
>>
>> That's fine, they can get to wait on the previously issued IO, which
>> was properly submitted with WB_SYNC_ALL.
>>
>> Maybe I'm missing your point?
>
> Making the background flusher block and wait for data makes it
> completely ineffective in speeding up sync() processing...

Agree, we should not wait on the pages individually, we want them
submitted and then waited on. And I suppose it's no differently than
handling the normal buffered write from an application, which then gets
waited on with fsync() or similar. So I can drop this patch, it'll work
fine without it.

--
Jens Axboe

2016-03-22 22:32:40

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCHSET][RFC] Make background writeback not suck

On Tue, Mar 22, 2016 at 04:03:28PM -0600, Jens Axboe wrote:
> On 03/22/2016 03:51 PM, Dave Chinner wrote:
> >On Tue, Mar 22, 2016 at 11:55:14AM -0600, Jens Axboe wrote:
> >>This patchset isn't as much a final solution, as it's demonstration
> >>of what I believe is a huge issue. Since the dawn of time, our
> >>background buffered writeback has sucked. When we do background
> >>buffered writeback, it should have little impact on foreground
> >>activity. That's the definition of background activity... But for as
> >>long as I can remember, heavy buffered writers has not behaved like
> >>that.
> >
> >Of course not. The IO scheduler is supposed to determine how we
> >meter out bulk vs latency sensitive IO that is queued. That's what
> >all the things like anticipatory scheduling for read requests was
> >supposed to address....
> >
> >I'm guessing you're seeing problems like this because blk-mq has no
> >IO scheduler infrastructure and so no way of prioritising,
> >scheduling and/or throttling different types of IO? Would that be
> >accurate?
>
> It's not just that, but obviously the IO scheduler would be one
> place to throttle it. This, in a way, is a way of scheduling the
> writeback writes better. But most of the reports I get on writeback
> sucking is not using scsi/blk-mq, they end up being "classic" on
> things like deadline.

Deadline doesn't have anticipatory read scheduling, right?

Really, I'm just trying to understand why this isn't being added as
part of the IO scheduler infrastructure, but is instead adding
another layer of non-optional IO scheduling to the block layer...

> >>The read starts out fine, but goes to shit when we start bacckground
> >>flushing. The reader experiences latency spikes in the seconds range.
> >>On flash.
> >>
> >>With this set of patches applies, the situation looks like this instead:
> >>
> >>--io---- -system-- ------cpu-----
> >>bi bo in cs us sy id wa st
> >> 33544 0 8650 17204 0 1 97 2 0
> >> 42488 0 10856 21756 0 0 97 3 0
> >> 42032 0 10719 21384 0 0 97 3 0
> >> 42544 12 10838 21631 0 0 97 3 0
> >> 42620 0 10982 21727 0 3 95 3 0
> >> 46392 0 11923 23597 0 3 94 3 0
> >> 36268 512000 9907 20044 0 3 91 5 0
> >> 31572 696324 8840 18248 0 1 91 7 0
> >> 30748 626692 8617 17636 0 2 91 6 0
> >> 31016 618504 8679 17736 0 3 91 6 0
> >> 30612 648196 8625 17624 0 3 91 6 0
> >> 30992 650296 8738 17859 0 3 91 6 0
> >> 30680 604075 8614 17605 0 3 92 6 0
> >> 30592 595040 8572 17564 0 2 92 6 0
> >> 31836 539656 8819 17962 0 2 92 5 0
> >
> >And now it runs at ~600MB/s, slowing down the rate at which memory
> >is cleaned by 60%.
>
> Which is the point, correct... If we're not anywhere near being
> tight on memory AND nobody is waiting for this IO, then by
> definition, the foreground activity is the important one. For the
> case used here, that's the application doing reads.

Unless, of course, we are in a situation where there is also large
memory demand, and we need to clean memory fast....

> >Given that background writeback is relied on by memory reclaim to
> >clean memory faster than the LRUs are cycled, I suspect this is
> >going to have a big impact on low memory behaviour and balance,
> >which will then feed into IO breakdown problems caused by writeback
> >being driven from the LRUs rather than the flusher threads.....
>
> You're missing the part where the intent is to only throttle it
> heavily when it's pure background writeback. Of course, if we are
> low on memory and doing reclaim, we should get much closer to device
> bandwidth.

A demonstration, please.

I didn't see anything in the code that treated low memory conditions
differently - that just uses
wakeup_flusher_threads(WB_REASON_TRY_TO_FREE_PAGES) from
do_try_to_free_pages() to trigger background writeback to run and
clean pages, so I'm interested to see exactly how that works out...

> If I run the above dd without the reader running, I'm already at 90%
> of the device bandwidth - not quite all the way there, since I still
> want to quickly be able to inject reads (or other IO) without having
> to wait for the queues to purge thousands of requests.

So, essentially, the model is to run background write at "near
starvation" queue depths, which works fine when the system is mostly
idle and we can dispatch more IO immediately. My concern with this
model is that under heavy IO and CPU load, writeback dispatch often
has significant delays (e.g. for allocation, etc). This is when we
need deeper queue depths to maintain throughput across dispatch
latency variations.

Many production workloads don't care about read latency, but do care
about bulk page cache throughput. Such workloads are going to be
adversely affected by a fundamental block layer IO dispatch model
change like this. This is why we have the pluggable IO schedulers in
the first place - one size does not fit all.

Hence I'm thinking that this should not be applied to all block
devices as this patch does, but instead be a part of the io
scheduling infrastructure we already have (and need for blk-mq).

Cheers,

Dave.
--
Dave Chinner
[email protected]

2016-03-22 22:57:16

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCHSET][RFC] Make background writeback not suck

On 03/22/2016 04:31 PM, Dave Chinner wrote:
> On Tue, Mar 22, 2016 at 04:03:28PM -0600, Jens Axboe wrote:
>> On 03/22/2016 03:51 PM, Dave Chinner wrote:
>>> On Tue, Mar 22, 2016 at 11:55:14AM -0600, Jens Axboe wrote:
>>>> This patchset isn't as much a final solution, as it's demonstration
>>>> of what I believe is a huge issue. Since the dawn of time, our
>>>> background buffered writeback has sucked. When we do background
>>>> buffered writeback, it should have little impact on foreground
>>>> activity. That's the definition of background activity... But for as
>>>> long as I can remember, heavy buffered writers has not behaved like
>>>> that.
>>>
>>> Of course not. The IO scheduler is supposed to determine how we
>>> meter out bulk vs latency sensitive IO that is queued. That's what
>>> all the things like anticipatory scheduling for read requests was
>>> supposed to address....
>>>
>>> I'm guessing you're seeing problems like this because blk-mq has no
>>> IO scheduler infrastructure and so no way of prioritising,
>>> scheduling and/or throttling different types of IO? Would that be
>>> accurate?
>>
>> It's not just that, but obviously the IO scheduler would be one
>> place to throttle it. This, in a way, is a way of scheduling the
>> writeback writes better. But most of the reports I get on writeback
>> sucking is not using scsi/blk-mq, they end up being "classic" on
>> things like deadline.
>
> Deadline doesn't have anticipatory read scheduling, right?
>
> Really, I'm just trying to understand why this isn't being added as
> part of the IO scheduler infrastructure, but is instead adding
> another layer of non-optional IO scheduling to the block layer...

This is less about IO scheduling than it is about making sure that
background writeback isn't too intrusive. Yes, you can argue that that
is a form of IO scheduling. But so is the rate detection, for instance,
and some of the other limits we put in. But we don't properly limit
depth, and I think we should.

Anticipatory read scheduling is a completely different animal, that is
centered around avoid long seeks on rotational media, assuming that
there's locality on disk for related (and back-to-back) reads.

This can easily be part of IO scheduling infrastructure, it already
somewhat is, given where it's placed.

>>>> The read starts out fine, but goes to shit when we start bacckground
>>>> flushing. The reader experiences latency spikes in the seconds range.
>>>> On flash.
>>>>
>>>> With this set of patches applies, the situation looks like this instead:
>>>>
>>>> --io---- -system-- ------cpu-----
>>>> bi bo in cs us sy id wa st
>>>> 33544 0 8650 17204 0 1 97 2 0
>>>> 42488 0 10856 21756 0 0 97 3 0
>>>> 42032 0 10719 21384 0 0 97 3 0
>>>> 42544 12 10838 21631 0 0 97 3 0
>>>> 42620 0 10982 21727 0 3 95 3 0
>>>> 46392 0 11923 23597 0 3 94 3 0
>>>> 36268 512000 9907 20044 0 3 91 5 0
>>>> 31572 696324 8840 18248 0 1 91 7 0
>>>> 30748 626692 8617 17636 0 2 91 6 0
>>>> 31016 618504 8679 17736 0 3 91 6 0
>>>> 30612 648196 8625 17624 0 3 91 6 0
>>>> 30992 650296 8738 17859 0 3 91 6 0
>>>> 30680 604075 8614 17605 0 3 92 6 0
>>>> 30592 595040 8572 17564 0 2 92 6 0
>>>> 31836 539656 8819 17962 0 2 92 5 0
>>>
>>> And now it runs at ~600MB/s, slowing down the rate at which memory
>>> is cleaned by 60%.
>>
>> Which is the point, correct... If we're not anywhere near being
>> tight on memory AND nobody is waiting for this IO, then by
>> definition, the foreground activity is the important one. For the
>> case used here, that's the application doing reads.
>
> Unless, of course, we are in a situation where there is also large
> memory demand, and we need to clean memory fast....

If that's the case, then we should ramp up limits. The important part
here is that we currently issue thousands of requests. Literally
thousands. Do we need thousands to get optimal throughput? No. In fact
it's detrimental to other system activity, without providing any
benefits. The idea here is to throttle us within a given window of
depth, let's call that 1..N, where N is enough to get us full write
performance. If we're doing pure background writes, QD is well less than
N. If the urgency increases, we'll ramp it up. But for all cases, we're
avoiding the situation of having several thousand of requests in flight.

>>> Given that background writeback is relied on by memory reclaim to
>>> clean memory faster than the LRUs are cycled, I suspect this is
>>> going to have a big impact on low memory behaviour and balance,
>>> which will then feed into IO breakdown problems caused by writeback
>>> being driven from the LRUs rather than the flusher threads.....
>>
>> You're missing the part where the intent is to only throttle it
>> heavily when it's pure background writeback. Of course, if we are
>> low on memory and doing reclaim, we should get much closer to device
>> bandwidth.
>
> A demonstration, please.
>
> I didn't see anything in the code that treated low memory conditions
> differently - that just uses
> wakeup_flusher_threads(WB_REASON_TRY_TO_FREE_PAGES) from
> do_try_to_free_pages() to trigger background writeback to run and
> clean pages, so I'm interested to see exactly how that works out...

It's not all there yet, this is an early RFC. The case that is handled
is the application being blocked on dirtying memory, we increase
bandwidth for that case. I'll look into reclaim next, honestly don't see
this as somethings that's hard to handle.

>> If I run the above dd without the reader running, I'm already at 90%
>> of the device bandwidth - not quite all the way there, since I still
>> want to quickly be able to inject reads (or other IO) without having
>> to wait for the queues to purge thousands of requests.
>
> So, essentially, the model is to run background write at "near
> starvation" queue depths, which works fine when the system is mostly
> idle and we can dispatch more IO immediately. My concern with this
> model is that under heavy IO and CPU load, writeback dispatch often
> has significant delays (e.g. for allocation, etc). This is when we
> need deeper queue depths to maintain throughput across dispatch
> latency variations.

Not near starvation. For most devices, it doesn't take a lot of writes
to get very close to max performance. I don't want to starve the device,
that's not the intention here. And if writeback kworkers being CPU
starved is a concern, then yes, of course that needs to be handled
appropriately.

If we're in or near a critical condition, then writeback must proceed
swiftly. For the general use case of that NOT being the case, then we
can limit writeback a bit and get much better system behavior for most
applications and users.

> Many production workloads don't care about read latency, but do care
> about bulk page cache throughput. Such workloads are going to be
> adversely affected by a fundamental block layer IO dispatch model
> change like this. This is why we have the pluggable IO schedulers in
> the first place - one size does not fit all.

I'd counter that with most production workloads DO care about IO
latencies, be it reads or writes. In fact it's often the single most
important thing that people complain about. When provisioning hardware
or setups, we're often in the situation that we have to drive things
softer than we would otherwise want to, to get more predictable
behavior. The current writeback is the opposite of predictable, unless
you consider the fact that it predictably shits itself with basic
operations like buffered writes.

On example is a user deliberately dirtying at X MB/sec, where X is well
below what the device can handle. That user know has two choices:

1) Do nothing, and incur the wrath of periodic writeback that issues a
ton of requests, disturbing everything else.
2) Insert periodic sync points to avoid queuing up too much. The user
needs to be able to block to do that, so that means offloading to a thread.

Neither of those are appealing choices. Wouldn't it be great if the user
didn't have to do anything, and have the system behave in a reasonable
manner? I think so.

> Hence I'm thinking that this should not be applied to all block
> devices as this patch does, but instead be a part of the io
> scheduling infrastructure we already have (and need for blk-mq).

Sure, that's a placement concern, and that's something that can be
worked on. I feel like we're arguing in circles here.

--
Jens Axboe

2016-03-25 02:08:56

by Mike Christie

[permalink] [raw]
Subject: Re: [PATCH 1/6] block: ensure we don't truncate top bits of the request command flags

On 03/22/2016 02:01 PM, Jens Axboe wrote:
> On 03/22/2016 12:59 PM, Christoph Hellwig wrote:
>> On Tue, Mar 22, 2016 at 11:55:15AM -0600, Jens Axboe wrote:
>>> Some of the flags that we want to use from the make_request_fn path
>>> are now larger than 32-bit, so change the functions involved to
>>> accept an u64 instead of an unsigned int.
>>
>> When did we start doing that? We really should merge Mike's split
>> of the operation style flags into the cmd_type before making things
>> even worse in the flags area.
>
> Just now, and I ran into it last week as well, for a test patch on cfq
> that passed in higher flags for get_request -> may_queue() as well. We
> can do Mike's split first, I think it's a good cleanup. As a standalone
> series, I needed it though.
>

Hey, did you want any changes on that patchset? I was going to repost it
with the kbuild fix against linux-next, but I can make any changes you
wanted first.

2016-03-25 04:19:32

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH 1/6] block: ensure we don't truncate top bits of the request command flags

On 03/24/2016 08:08 PM, Mike Christie wrote:
> On 03/22/2016 02:01 PM, Jens Axboe wrote:
>> On 03/22/2016 12:59 PM, Christoph Hellwig wrote:
>>> On Tue, Mar 22, 2016 at 11:55:15AM -0600, Jens Axboe wrote:
>>>> Some of the flags that we want to use from the make_request_fn path
>>>> are now larger than 32-bit, so change the functions involved to
>>>> accept an u64 instead of an unsigned int.
>>>
>>> When did we start doing that? We really should merge Mike's split
>>> of the operation style flags into the cmd_type before making things
>>> even worse in the flags area.
>>
>> Just now, and I ran into it last week as well, for a test patch on cfq
>> that passed in higher flags for get_request -> may_queue() as well. We
>> can do Mike's split first, I think it's a good cleanup. As a standalone
>> series, I needed it though.
>>
>
> Hey, did you want any changes on that patchset? I was going to repost it
> with the kbuild fix against linux-next, but I can make any changes you
> wanted first.

I don't believe I've ever been CC'ed on the posting, or it even being
posted on the block list? If so, I don't see it... I did become aware of
it since Christoph CC'ed me in. In general, I think it looks good, at
least the end results. It's a bit murky in the middle, and the commit
messages need some help. So go over everything, sanitize it, and repost
it. I don't like the current pure flag based scheme we have, it's a mess
of ops and modifiers. So splitting that up is definitely a good thing.

--
Jens Axboe