LinuxLists.cc -

2016-03-23 15:26:15

Subject:

This patchset isn't as much a final solution, as it's demonstration
of what I believe is a huge issue. Since the dawn of time, our
background buffered writeback has sucked. When we do background
buffered writeback, it should have little impact on foreground
activity. That's the definition of background activity... But for as
long as I can remember, heavy buffered writers has not behaved like
that. For instance, if I do something like this:

$ dd if=/dev/zero of=foo bs=1M count=10k

on my laptop, and then try and start chrome, it basically won't start
before the buffered writeback is done. Or, for server oriented
workloads, where installation of a big RPM (or similar) adversely
impacts data base reads or sync writes. When that happens, I get people
yelling at me.

A quick demonstration - a fio job that reads a a file, while someone
else issues the above 'dd'. Run on a flash device, using XFS. The
vmstat output looks something like this:

--io---- -system-- ------cpu-----
bi bo in cs us sy id wa st
156 4648 58 151 0 1 98 1 0
0 0 64 83 0 0 100 0 0
0 32 76 119 0 0 100 0 0
26616 0 7574 13907 7 0 91 2 0
41992 0 10811 21395 0 2 95 3 0
46040 0 11836 23395 0 3 94 3 0
19376 1310736 5894 10080 0 4 93 3 0
116 1974296 1858 455 0 4 93 3 0
124 2020372 1964 545 0 4 92 4 0
112 1678356 1955 620 0 3 93 3 0
8560 405508 3759 4756 0 1 96 3 0
42496 0 10798 21566 0 0 97 3 0
42476 0 10788 21524 0 0 97 3 0

The read starts out fine, but goes to shit when we start background
flushing. The reader experiences latency spikes in the seconds range.
On flash.

With this set of patches applies, the situation looks like this instead:

--io---- -system-- ------cpu-----
bi bo in cs us sy id wa st
33544 0 8650 17204 0 1 97 2 0
42488 0 10856 21756 0 0 97 3 0
42032 0 10719 21384 0 0 97 3 0
42544 12 10838 21631 0 0 97 3 0
42620 0 10982 21727 0 3 95 3 0
46392 0 11923 23597 0 3 94 3 0
36268 512000 9907 20044 0 3 91 5 0
31572 696324 8840 18248 0 1 91 7 0
30748 626692 8617 17636 0 2 91 6 0
31016 618504 8679 17736 0 3 91 6 0
30612 648196 8625 17624 0 3 91 6 0
30992 650296 8738 17859 0 3 91 6 0
30680 604075 8614 17605 0 3 92 6 0
30592 595040 8572 17564 0 2 92 6 0
31836 539656 8819 17962 0 2 92 5 0

and the reader never sees latency spikes above a few miliseconds.

The above was the why. The how is basically throttling background
writeback. We still want to issue big writes from the vm side of things,
so we get nice and big extents on the file system end. But we don't need
to flood the device with THOUSANDS of requests for background writeback.
For most devices, we don't need a whole lot to get decent throughput.

This adds some simple blk-wb code that keeps limits how much buffered
writeback we keep in flight on the device end. The default is pretty
low. If we end up switching to WB_SYNC_ALL, we up the limits. If the
dirtying task ends up being throttled in balance_dirty_pages(), we up
the limit. If we need to reclaim memory, we up the limit. The cases
that need to clean memory at or near device speeds, they get to do
that. We still don't need thousands of requests to accomplish that.
And for the cases where we don't need to be near device limits, we
can clean at a more reasonable pace. Currently there are two tunables
associated with this, see the last patch for descriptions of those.

I welcome testing. The end goal here would be having much of this
auto-tuned, so that we don't lose substantial bandwidth for background
writes, while still maintaining decent non-wb performance and latencies.
The patchset should be fully stable, I have not observed problems. It
passes full xfstest runs, and a variety of benchmarks as well. It
should work equally well on blk-mq/scsi-mq, and "classic" setups.

You can also find this in a branch in the block git repo:

git://git.kernel.dk/linux-block.git wb-buf-throttle-v2

Patches are against current Linus' git, 4.5.0+.

Changes since v1

- Drop sync() WB_SYNC_NONE -> WB_SYNC_ALL change
- wb_start_writeback() fills in background/reclaim/sync info in
the writeback work, based on writeback reason.
- Use WRITE_SYNC for reclaim/sync IO
- Split balance_dirty_pages() sleep change into separate patch
- Drop get_request() u64 flag change, set the bit on the request
directly after-the-fact.
- Fix wrong sysfs return value
- Various small cleanups

block/Makefile | 2
block/blk-core.c | 15 ++
block/blk-mq.c | 32 +++++
block/blk-settings.c | 11 +
block/blk-sysfs.c | 123 +++++++++++++++++++++
block/blk-wb.c | 219 +++++++++++++++++++++++++++++++++++++++
block/blk-wb.h | 27 ++++
drivers/nvme/host/core.c | 1
drivers/scsi/sd.c | 5
fs/block_dev.c | 2
fs/buffer.c | 2
fs/f2fs/data.c | 2
fs/f2fs/node.c | 2
fs/fs-writeback.c | 17 +++
fs/gfs2/meta_io.c | 3
fs/mpage.c | 9 -
fs/xfs/xfs_aops.c | 2
include/linux/backing-dev-defs.h | 2
include/linux/blk_types.h | 2
include/linux/blkdev.h | 7 +
include/linux/writeback.h | 8 +
mm/page-writeback.c | 2
22 files changed, 479 insertions(+), 16 deletions(-)

--
Jens Axboe

2016-03-23 15:26:18

by Jens Axboe

[permalink] [raw]

Subject: [PATCH 3/8] writeback: use WRITE_SYNC for reclaim or sync writeback

If we're doing reclaim or sync IO, use WRITE_SYNC to inform the lower
levels of the importance of this IO.

Signed-off-by: Jens Axboe <[email protected]>
---
include/linux/writeback.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 719c255e105a..b2c75b8901da 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -102,7 +102,7 @@ struct writeback_control {

static inline int wbc_to_write(struct writeback_control *wbc)
{
- if (wbc->sync_mode == WB_SYNC_ALL)
+ if (wbc->sync_mode == WB_SYNC_ALL || wbc->for_reclaim || wbc->for_sync)
return WRITE_SYNC;

return WRITE;
--
2.4.1.168.g1ea28e1

2016-03-23 15:26:21

by Jens Axboe

[permalink] [raw]

Subject: [PATCH 8/8] writeback: throttle buffered writeback

Test patch that throttles buffered writeback to make it a lot
more smooth, and has way less impact on other system activity.
Background writeback should be, by definition, background
activity. The fact that we flush huge bundles of it at the time
means that it potentially has heavy impacts on foreground workloads,
which isn't ideal. We can't easily limit the sizes of writes that
we do, since that would impact file system layout in the presence
of delayed allocation. So just throttle back buffered writeback,
unless someone is waiting for it.

Would likely need a dynamic adaption to the current device, this
one has only been tested on NVMe. But it brings down background
activity impact from 1-2s to tens of milliseconds instead.

This is just a test patch, and as such, it registers a queue sysfs
entry to both monitor the current state:

$ cat /sys/block/nvme0n1/queue/wb_stats
limit=4, batch=2, inflight=0, wait=0, timer=0

'limit' denotes how many requests we will allow inflight for buffered
writeback, this settings can be tweaked through writing to the
'wb_depth' file. Writing '0' turns this off completely. 'inflight' shows
how many requests are currently inflight for buffered writeback, 'wait'
shows if anyone is currently waiting for access, and 'timer' shows
if we have processes being deferred in write back cache timeout.

Background buffered writeback will be throttled at depth 'wb_depth',
and even lower (QD=1) if the device recently completed "competing" IO.
If we are doing reclaim or otherwise sync buffered writeback, the limit
is increased 4x to achieve full device bandwidth.

Finally, if the device has write back caching, 'wb_cache_delay' delays
by this amount of usecs when a write completes before allowing more.

Signed-off-by: Jens Axboe <[email protected]>
---
block/Makefile | 2 +-
block/blk-core.c | 15 ++++
block/blk-mq.c | 32 ++++++-
block/blk-sysfs.c | 84 ++++++++++++++++++
block/blk-wb.c | 219 ++++++++++++++++++++++++++++++++++++++++++++++
block/blk-wb.h | 27 ++++++
include/linux/blk_types.h | 2 +
include/linux/blkdev.h | 3 +
8 files changed, 381 insertions(+), 3 deletions(-)
create mode 100644 block/blk-wb.c
create mode 100644 block/blk-wb.h

diff --git a/block/Makefile b/block/Makefile
index 9eda2322b2d4..9df911a3b569 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -5,7 +5,7 @@
obj-$(CONFIG_BLOCK) := bio.o elevator.o blk-core.o blk-tag.o blk-sysfs.o \
blk-flush.o blk-settings.o blk-ioc.o blk-map.o \
blk-exec.o blk-merge.o blk-softirq.o blk-timeout.o \
- blk-lib.o blk-mq.o blk-mq-tag.o \
+ blk-lib.o blk-mq.o blk-mq-tag.o blk-wb.o \
blk-mq-sysfs.o blk-mq-cpu.o blk-mq-cpumap.o ioctl.o \
genhd.o scsi_ioctl.o partition-generic.o ioprio.o \
badblocks.o partitions/
diff --git a/block/blk-core.c b/block/blk-core.c
index 827f8badd143..887a9e64c6ef 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -39,6 +39,7 @@

#include "blk.h"
#include "blk-mq.h"
+#include "blk-wb.h"

EXPORT_TRACEPOINT_SYMBOL_GPL(block_bio_remap);
EXPORT_TRACEPOINT_SYMBOL_GPL(block_rq_remap);
@@ -848,6 +849,9 @@ blk_init_allocated_queue(struct request_queue *q, request_fn_proc *rfn,
if (blk_init_rl(&q->root_rl, q, GFP_KERNEL))
goto fail;

+ if (blk_buffered_writeback_init(q))
+ goto fail;
+
INIT_WORK(&q->timeout_work, blk_timeout_work);
q->request_fn = rfn;
q->prep_rq_fn = NULL;
@@ -880,6 +884,7 @@ blk_init_allocated_queue(struct request_queue *q, request_fn_proc *rfn,

fail:
blk_free_flush_queue(q->fq);
+ blk_buffered_writeback_exit(q);
return NULL;
}
EXPORT_SYMBOL(blk_init_allocated_queue);
@@ -1485,6 +1490,8 @@ void __blk_put_request(struct request_queue *q, struct request *req)
/* this is a bio leak */
WARN_ON(req->bio != NULL);

+ blk_buffered_writeback_done(q->rq_wb, req);
+
/*
* Request may not have originated from ll_rw_blk. if not,
* it didn't come out of our reserved rq pools
@@ -1714,6 +1721,7 @@ static blk_qc_t blk_queue_bio(struct request_queue *q, struct bio *bio)
int el_ret, rw_flags, where = ELEVATOR_INSERT_SORT;
struct request *req;
unsigned int request_count = 0;
+ bool wb_acct;

/*
* low level driver can indicate that it wants pages above a
@@ -1766,6 +1774,8 @@ static blk_qc_t blk_queue_bio(struct request_queue *q, struct bio *bio)
}

get_rq:
+ wb_acct = blk_buffered_writeback_wait(q->rq_wb, bio, q->queue_lock);
+
/*
* This sync check and mask will be re-done in init_request_from_bio(),
* but we need to set it earlier to expose the sync flag to the
@@ -1781,11 +1791,16 @@ get_rq:
*/
req = get_request(q, rw_flags, bio, GFP_NOIO);
if (IS_ERR(req)) {
+ if (wb_acct)
+ __blk_buffered_writeback_done(q->rq_wb);
bio->bi_error = PTR_ERR(req);
bio_endio(bio);
goto out_unlock;
}

+ if (wb_acct)
+ req->cmd_flags |= REQ_BUF_INFLIGHT;
+
/*
* After dropping the lock and possibly sleeping here, our request
* may now be mergeable after it had proven unmergeable (above).
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 050f7a13021b..55aace97fd35 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -29,6 +29,7 @@
#include "blk.h"
#include "blk-mq.h"
#include "blk-mq-tag.h"
+#include "blk-wb.h"

static DEFINE_MUTEX(all_q_mutex);
static LIST_HEAD(all_q_list);
@@ -274,6 +275,9 @@ static void __blk_mq_free_request(struct blk_mq_hw_ctx *hctx,

if (rq->cmd_flags & REQ_MQ_INFLIGHT)
atomic_dec(&hctx->nr_active);
+
+ blk_buffered_writeback_done(q->rq_wb, rq);
+
rq->cmd_flags = 0;

clear_bit(REQ_ATOM_STARTED, &rq->atomic_flags);
@@ -1253,6 +1257,7 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
struct blk_plug *plug;
struct request *same_queue_rq = NULL;
blk_qc_t cookie;
+ bool wb_acct;

blk_queue_bounce(q, &bio);

@@ -1270,9 +1275,17 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
} else
request_count = blk_plug_queued_count(q);

+ wb_acct = blk_buffered_writeback_wait(q->rq_wb, bio, NULL);
+
rq = blk_mq_map_request(q, bio, &data);
- if (unlikely(!rq))
+ if (unlikely(!rq)) {
+ if (wb_acct)
+ __blk_buffered_writeback_done(q->rq_wb);
return BLK_QC_T_NONE;
+ }
+
+ if (wb_acct)
+ rq->cmd_flags |= REQ_BUF_INFLIGHT;

cookie = blk_tag_to_qc_t(rq->tag, data.hctx->queue_num);

@@ -1349,6 +1362,7 @@ static blk_qc_t blk_sq_make_request(struct request_queue *q, struct bio *bio)
struct blk_map_ctx data;
struct request *rq;
blk_qc_t cookie;
+ bool wb_acct;

blk_queue_bounce(q, &bio);

@@ -1363,9 +1377,17 @@ static blk_qc_t blk_sq_make_request(struct request_queue *q, struct bio *bio)
blk_attempt_plug_merge(q, bio, &request_count, NULL))
return BLK_QC_T_NONE;

+ wb_acct = blk_buffered_writeback_wait(q->rq_wb, bio, NULL);
+
rq = blk_mq_map_request(q, bio, &data);
- if (unlikely(!rq))
+ if (unlikely(!rq)) {
+ if (wb_acct)
+ __blk_buffered_writeback_done(q->rq_wb);
return BLK_QC_T_NONE;
+ }
+
+ if (wb_acct)
+ rq->cmd_flags |= REQ_BUF_INFLIGHT;

cookie = blk_tag_to_qc_t(rq->tag, data.hctx->queue_num);

@@ -2018,6 +2040,9 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
/* mark the queue as mq asap */
q->mq_ops = set->ops;

+ if (blk_buffered_writeback_init(q))
+ return ERR_PTR(-ENOMEM);
+
q->queue_ctx = alloc_percpu(struct blk_mq_ctx);
if (!q->queue_ctx)
return ERR_PTR(-ENOMEM);
@@ -2084,6 +2109,7 @@ err_map:
kfree(q->queue_hw_ctx);
err_percpu:
free_percpu(q->queue_ctx);
+ blk_buffered_writeback_exit(q);
return ERR_PTR(-ENOMEM);
}
EXPORT_SYMBOL(blk_mq_init_allocated_queue);
@@ -2096,6 +2122,8 @@ void blk_mq_free_queue(struct request_queue *q)
list_del_init(&q->all_q_node);
mutex_unlock(&all_q_mutex);

+ blk_buffered_writeback_exit(q);
+
blk_mq_del_queue_tag_set(q);

blk_mq_exit_hw_queues(q, set, set->nr_hw_queues);
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 954e510452d7..9ac9be23e700 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -13,6 +13,7 @@

#include "blk.h"
#include "blk-mq.h"
+#include "blk-wb.h"

struct queue_sysfs_entry {
struct attribute attr;
@@ -347,6 +348,71 @@ static ssize_t queue_poll_store(struct request_queue *q, const char *page,
return ret;
}

+static ssize_t queue_wb_stats_show(struct request_queue *q, char *page)
+{
+ struct rq_wb *wb = q->rq_wb;
+
+ if (!q->rq_wb)
+ return -EINVAL;
+
+ return sprintf(page, "limit=%d, batch=%d, inflight=%d, wait=%d, timer=%d\n",
+ wb->limit, wb->batch, atomic_read(&wb->inflight),
+ waitqueue_active(&wb->wait), timer_pending(&wb->timer));
+}
+
+static ssize_t queue_wb_depth_show(struct request_queue *q, char *page)
+{
+ if (!q->rq_wb)
+ return -EINVAL;
+
+ return queue_var_show(q->rq_wb->limit, page);
+}
+
+static ssize_t queue_wb_depth_store(struct request_queue *q, const char *page,
+ size_t count)
+{
+ unsigned long var;
+ ssize_t ret;
+
+ if (!q->rq_wb)
+ return -EINVAL;
+
+ ret = queue_var_store(&var, page, count);
+ if (ret < 0)
+ return ret;
+ if (var != (unsigned int) var)
+ return -EINVAL;
+
+ blk_update_wb_limit(q->rq_wb, var);
+ return ret;
+}
+
+static ssize_t queue_wb_cache_delay_show(struct request_queue *q, char *page)
+{
+ if (!q->rq_wb)
+ return -EINVAL;
+
+ return queue_var_show(q->rq_wb->cache_delay_usecs, page);
+}
+
+static ssize_t queue_wb_cache_delay_store(struct request_queue *q,
+ const char *page, size_t count)
+{
+ unsigned long var;
+ ssize_t ret;
+
+ if (!q->rq_wb)
+ return -EINVAL;
+
+ ret = queue_var_store(&var, page, count);
+ if (ret < 0)
+ return ret;
+
+ q->rq_wb->cache_delay_usecs = var;
+ q->rq_wb->cache_delay = usecs_to_jiffies(var);
+ return ret;
+}
+
static ssize_t queue_wc_show(struct request_queue *q, char *page)
{
if (test_bit(QUEUE_FLAG_WC, &q->queue_flags))
@@ -516,6 +582,21 @@ static struct queue_sysfs_entry queue_wc_entry = {
.store = queue_wc_store,
};

+static struct queue_sysfs_entry queue_wb_stats_entry = {
+ .attr = {.name = "wb_stats", .mode = S_IRUGO },
+ .show = queue_wb_stats_show,
+};
+static struct queue_sysfs_entry queue_wb_cache_delay_entry = {
+ .attr = {.name = "wb_cache_usecs", .mode = S_IRUGO | S_IWUSR },
+ .show = queue_wb_cache_delay_show,
+ .store = queue_wb_cache_delay_store,
+};
+static struct queue_sysfs_entry queue_wb_depth_entry = {
+ .attr = {.name = "wb_depth", .mode = S_IRUGO | S_IWUSR },
+ .show = queue_wb_depth_show,
+ .store = queue_wb_depth_store,
+};
+
static struct attribute *default_attrs[] = {
&queue_requests_entry.attr,
&queue_ra_entry.attr,
@@ -542,6 +623,9 @@ static struct attribute *default_attrs[] = {
&queue_random_entry.attr,
&queue_poll_entry.attr,
&queue_wc_entry.attr,
+ &queue_wb_stats_entry.attr,
+ &queue_wb_cache_delay_entry.attr,
+ &queue_wb_depth_entry.attr,
NULL,
};

diff --git a/block/blk-wb.c b/block/blk-wb.c
new file mode 100644
index 000000000000..2aa3753a8e1e
--- /dev/null
+++ b/block/blk-wb.c
@@ -0,0 +1,219 @@
+/*
+ * buffered writeback throttling
+ *
+ * Copyright (C) 2016 Jens Axboe
+ *
+ * Things that need changing:
+ *
+ * - Auto-detection of most of this, no tunables. Cache type we can get,
+ * and most other settings we can tweak/gather based on time.
+ * - Better solution for rwb->bdp_wait?
+ * - Higher depth for WB_SYNC_ALL?
+ *
+ */
+#include <linux/kernel.h>
+#include <linux/bio.h>
+#include <linux/blkdev.h>
+
+#include "blk.h"
+#include "blk-wb.h"
+
+void __blk_buffered_writeback_done(struct rq_wb *rwb)
+{
+ int inflight;
+
+ inflight = atomic_dec_return(&rwb->inflight);
+ if (inflight >= rwb->limit)
+ return;
+
+ /*
+ * If the device does caching, we can still flood it with IO
+ * even at a low depth. If caching is on, delay a bit before
+ * submitting the next, if we're still purely background
+ * activity.
+ */
+ if (test_bit(QUEUE_FLAG_WC, &rwb->q->queue_flags) && !*rwb->bdp_wait &&
+ time_before(jiffies, rwb->last_comp + rwb->cache_delay)) {
+ if (!timer_pending(&rwb->timer))
+ mod_timer(&rwb->timer, jiffies + rwb->cache_delay);
+ return;
+ }
+
+ if (waitqueue_active(&rwb->wait)) {
+ int diff = rwb->limit - inflight;
+
+ if (diff >= rwb->batch)
+ wake_up_nr(&rwb->wait, 1);
+ }
+}
+
+/*
+ * Called on completion of a request. Note that it's also called when
+ * a request is merged, when the request gets freed.
+ */
+void blk_buffered_writeback_done(struct rq_wb *rwb, struct request *rq)
+{
+ if (!(rq->cmd_flags & REQ_BUF_INFLIGHT)) {
+ const unsigned long cur = jiffies;
+
+ if (rwb->limit && cur != rwb->last_comp)
+ rwb->last_comp = cur;
+ } else
+ __blk_buffered_writeback_done(rwb);
+}
+
+/*
+ * Increment 'v', if 'v' is below 'below'. Returns true if we succeeded,
+ * false if 'v' + 1 would be bigger than 'below'.
+ */
+static bool atomic_inc_below(atomic_t *v, int below)
+{
+ int cur = atomic_read(v);
+
+ for (;;) {
+ int old;
+
+ if (cur >= below)
+ return false;
+ old = atomic_cmpxchg(v, cur, cur + 1);
+ if (old == cur)
+ break;
+ cur = old;
+ }
+
+ return true;
+}
+
+/*
+ * Block if we will exceed our limit, or if we are currently waiting for
+ * the timer to kick off queuing again.
+ */
+static void __blk_buffered_writeback_wait(struct rq_wb *rwb, unsigned int limit,
+ spinlock_t *lock)
+{
+ DEFINE_WAIT(wait);
+
+ if (!timer_pending(&rwb->timer) &&
+ atomic_inc_below(&rwb->inflight, limit))
+ return;
+
+ do {
+ prepare_to_wait_exclusive(&rwb->wait, &wait,
+ TASK_UNINTERRUPTIBLE);
+
+ if (!timer_pending(&rwb->timer) &&
+ atomic_inc_below(&rwb->inflight, limit))
+ break;
+
+ if (lock)
+ spin_unlock_irq(lock);
+
+ io_schedule();
+
+ if (lock)
+ spin_lock_irq(lock);
+ } while (1);
+
+ finish_wait(&rwb->wait, &wait);
+}
+
+/*
+ * Returns true if the IO request should be accounted, false if not.
+ * May sleep, if we have exceeded the writeback limits. Caller can pass
+ * in an irq held spinlock, if it holds one when calling this function.
+ * If we do sleep, we'll release and re-grab it.
+ */
+bool blk_buffered_writeback_wait(struct rq_wb *rwb, struct bio *bio,
+ spinlock_t *lock)
+{
+ unsigned int limit;
+
+ /*
+ * If disabled, or not a WRITE (or a discard), do nothing
+ */
+ if (!rwb->limit || !(bio->bi_rw & REQ_WRITE) ||
+ (bio->bi_rw & REQ_DISCARD))
+ return false;
+
+ /*
+ * Don't throttle WRITE_ODIRECT
+ */
+ if ((bio->bi_rw & (REQ_SYNC | REQ_NOIDLE)) == REQ_SYNC)
+ return false;
+
+ /*
+ * At this point we know it's a buffered write. If REQ_SYNC is
+ * set, then it's WB_SYNC_ALL writeback. Bump the limit 4x for
+ * those, since someone is (or will be) waiting on that.
+ */
+ limit = rwb->limit;
+ if (bio->bi_rw & REQ_SYNC)
+ limit <<= 2;
+ else if (limit != 1) {
+ /*
+ * If less than 100ms since we completed unrelated IO,
+ * limit us to a depth of 1 for background writeback.
+ */
+ if (time_before(jiffies, rwb->last_comp + HZ / 10))
+ limit = 1;
+ else if (!*rwb->bdp_wait)
+ limit >>= 1;
+ }
+
+ __blk_buffered_writeback_wait(rwb, limit, lock);
+ return true;
+}
+
+void blk_update_wb_limit(struct rq_wb *rwb, unsigned int limit)
+{
+ rwb->limit = limit;
+ rwb->batch = rwb->limit / 2;
+ if (!rwb->batch && rwb->limit)
+ rwb->batch = 1;
+ else if (rwb->batch > 4)
+ rwb->batch = 4;
+
+ wake_up_all(&rwb->wait);
+}
+
+static void blk_buffered_writeback_timer(unsigned long data)
+{
+ struct rq_wb *rwb = (struct rq_wb *) data;
+
+ if (waitqueue_active(&rwb->wait))
+ wake_up_nr(&rwb->wait, 1);
+}
+
+#define DEF_WB_LIMIT 4
+#define DEF_WB_CACHE_DELAY 10000
+
+int blk_buffered_writeback_init(struct request_queue *q)
+{
+ struct rq_wb *rwb;
+
+ rwb = kzalloc(sizeof(*rwb), GFP_KERNEL);
+ if (!rwb)
+ return -ENOMEM;
+
+ atomic_set(&rwb->inflight, 0);
+ init_waitqueue_head(&rwb->wait);
+ rwb->last_comp = jiffies;
+ rwb->bdp_wait = &q->backing_dev_info.wb.dirty_sleeping;
+ setup_timer(&rwb->timer, blk_buffered_writeback_timer,
+ (unsigned long) rwb);
+ rwb->cache_delay_usecs = DEF_WB_CACHE_DELAY;
+ rwb->cache_delay = usecs_to_jiffies(rwb->cache_delay);
+ rwb->q = q;
+ blk_update_wb_limit(rwb, DEF_WB_LIMIT);
+ q->rq_wb = rwb;
+ return 0;
+}
+
+void blk_buffered_writeback_exit(struct request_queue *q)
+{
+ if (q->rq_wb)
+ del_timer_sync(&q->rq_wb->timer);
+
+ kfree(q->rq_wb);
+ q->rq_wb = NULL;
+}
diff --git a/block/blk-wb.h b/block/blk-wb.h
new file mode 100644
index 000000000000..f3b4cd139815
--- /dev/null
+++ b/block/blk-wb.h
@@ -0,0 +1,27 @@
+#ifndef BLK_WB_H
+#define BLK_WB_H
+
+#include <linux/atomic.h>
+#include <linux/wait.h>
+
+struct rq_wb {
+ unsigned int limit;
+ unsigned int batch;
+ unsigned int cache_delay;
+ unsigned int cache_delay_usecs;
+ unsigned long last_comp;
+ unsigned int *bdp_wait;
+ struct request_queue *q;
+ atomic_t inflight;
+ wait_queue_head_t wait;
+ struct timer_list timer;
+};
+
+void __blk_buffered_writeback_done(struct rq_wb *);
+void blk_buffered_writeback_done(struct rq_wb *, struct request *);
+bool blk_buffered_writeback_wait(struct rq_wb *, struct bio *, spinlock_t *);
+int blk_buffered_writeback_init(struct request_queue *);
+void blk_buffered_writeback_exit(struct request_queue *);
+void blk_update_wb_limit(struct rq_wb *, unsigned int);
+
+#endif
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 86a38ea1823f..6f2a174b771c 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -188,6 +188,7 @@ enum rq_flag_bits {
__REQ_PM, /* runtime pm request */
__REQ_HASHED, /* on IO scheduler merge hash */
__REQ_MQ_INFLIGHT, /* track inflight for MQ */
+ __REQ_BUF_INFLIGHT, /* track inflight for buffered */
__REQ_NR_BITS, /* stops here */
};

@@ -241,6 +242,7 @@ enum rq_flag_bits {
#define REQ_PM (1ULL << __REQ_PM)
#define REQ_HASHED (1ULL << __REQ_HASHED)
#define REQ_MQ_INFLIGHT (1ULL << __REQ_MQ_INFLIGHT)
+#define REQ_BUF_INFLIGHT (1ULL << __REQ_BUF_INFLIGHT)

typedef unsigned int blk_qc_t;
#define BLK_QC_T_NONE -1U
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 76e875159e52..8586685bf7b2 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -37,6 +37,7 @@ struct bsg_job;
struct blkcg_gq;
struct blk_flush_queue;
struct pr_ops;
+struct rq_wb;

#define BLKDEV_MIN_RQ 4
#define BLKDEV_MAX_RQ 128 /* Default maximum */
@@ -290,6 +291,8 @@ struct request_queue {
int nr_rqs[2]; /* # allocated [a]sync rqs */
int nr_rqs_elvpriv; /* # allocated rqs w/ elvpriv */

+ struct rq_wb *rq_wb;
+
/*
* If blkcg is not used, @q->root_rl serves all requests. If blkcg
* is used, root blkg allocates from @q->root_rl and all other
--
2.4.1.168.g1ea28e1

2016-03-23 15:26:22

by Jens Axboe

[permalink] [raw]

Subject: [PATCH 7/8] NVMe: inform block layer of write cache state

This isn't quite correct, since the VWC merely states if a potential
write back cache is volatile or not. But for the purpose of write
absortion, it's good enough.

Signed-off-by: Jens Axboe <[email protected]>
---
drivers/nvme/host/core.c | 1 +
1 file changed, 1 insertion(+)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 643f457131c2..05c8edfb7611 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -906,6 +906,7 @@ static void nvme_set_queue_limits(struct nvme_ctrl *ctrl,
if (ctrl->vwc & NVME_CTRL_VWC_PRESENT)
blk_queue_flush(q, REQ_FLUSH | REQ_FUA);
blk_queue_virt_boundary(q, ctrl->page_size - 1);
+ blk_queue_write_cache(q, ctrl->vwc & NVME_CTRL_VWC_PRESENT);
}

/*
--
2.4.1.168.g1ea28e1

2016-03-23 15:26:55

by Jens Axboe

[permalink] [raw]

Subject: [PATCH 6/8] sd: inform block layer of write cache state

Signed-off-by: Jens Axboe <[email protected]>
---
drivers/scsi/sd.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 5a5457ac9cdb..049f424fb4ad 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -192,6 +192,7 @@ cache_type_store(struct device *dev, struct device_attribute *attr,
sdkp->WCE = wce;
sdkp->RCD = rcd;
sd_set_flush_flag(sdkp);
+ blk_queue_write_cache(sdp->request_queue, wce != 0);
return count;
}

@@ -2571,7 +2572,7 @@ sd_read_cache_type(struct scsi_disk *sdkp, unsigned char *buffer)
sdkp->DPOFUA ? "supports DPO and FUA"
: "doesn't support DPO or FUA");

- return;
+ goto done;
}

bad_sense:
@@ -2596,6 +2597,8 @@ defaults:
}
sdkp->RCD = 0;
sdkp->DPOFUA = 0;
+done:
+ blk_queue_write_cache(sdp->request_queue, sdkp->WCE != 0);
}

/*
--
2.4.1.168.g1ea28e1

2016-03-23 15:26:16

by Jens Axboe

[permalink] [raw]

Subject: [PATCH 1/8] writeback: propagate the various reasons for writeback

Avoid losing context by propagating the various reason why we
initiate writeback.

Signed-off-by: Jens Axboe <[email protected]>
---
fs/fs-writeback.c | 17 +++++++++++++++++
1 file changed, 17 insertions(+)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 5c46ed9f3e14..387610cf4f7f 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -52,6 +52,7 @@ struct wb_writeback_work {
unsigned int range_cyclic:1;
unsigned int for_background:1;
unsigned int for_sync:1; /* sync(2) WB_SYNC_ALL writeback */
+ unsigned int for_reclaim:1; /* for mem reclaim */
unsigned int auto_free:1; /* free on completion */
enum wb_reason reason; /* why was writeback initiated? */

@@ -942,6 +943,21 @@ void wb_start_writeback(struct bdi_writeback *wb, long nr_pages,
work->reason = reason;
work->auto_free = 1;

+ switch (reason) {
+ case WB_REASON_BACKGROUND:
+ case WB_REASON_PERIODIC:
+ work->for_background = 1;
+ break;
+ case WB_REASON_TRY_TO_FREE_PAGES:
+ case WB_REASON_FREE_MORE_MEM:
+ work->for_reclaim = 1;
+ case WB_REASON_SYNC:
+ work->for_sync = 1;
+ break;
+ default:
+ break;
+ }
+
wb_queue_work(wb, work);
}

@@ -1443,6 +1459,7 @@ static long writeback_sb_inodes(struct super_block *sb,
.for_kupdate = work->for_kupdate,
.for_background = work->for_background,
.for_sync = work->for_sync,
+ .for_reclaim = work->for_reclaim,
.range_cyclic = work->range_cyclic,
.range_start = 0,
.range_end = LLONG_MAX,
--
2.4.1.168.g1ea28e1

2016-03-23 15:27:41

by Jens Axboe

[permalink] [raw]

Subject: [PATCH 5/8] block: add ability to flag write back caching on a device

Add an internal helper and flag for setting whether a queue has
write back caching, or write through (or none). Add a sysfs file
to show this as well, and make it changeable from user space.

Signed-off-by: Jens Axboe <[email protected]>
---
block/blk-settings.c | 11 +++++++++++
block/blk-sysfs.c | 39 +++++++++++++++++++++++++++++++++++++++
include/linux/blkdev.h | 4 ++++
3 files changed, 54 insertions(+)

diff --git a/block/blk-settings.c b/block/blk-settings.c
index c7bb666aafd1..4dbd511a9889 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -846,6 +846,17 @@ void blk_queue_flush_queueable(struct request_queue *q, bool queueable)
}
EXPORT_SYMBOL_GPL(blk_queue_flush_queueable);

+void blk_queue_write_cache(struct request_queue *q, bool enabled)
+{
+ spin_lock_irq(q->queue_lock);
+ if (enabled)
+ queue_flag_set(QUEUE_FLAG_WC, q);
+ else
+ queue_flag_clear(QUEUE_FLAG_WC, q);
+ spin_unlock_irq(q->queue_lock);
+}
+EXPORT_SYMBOL_GPL(blk_queue_write_cache);
+
static int __init blk_settings_init(void)
{
blk_max_low_pfn = max_low_pfn - 1;
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index dd93763057ce..954e510452d7 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -347,6 +347,38 @@ static ssize_t queue_poll_store(struct request_queue *q, const char *page,
return ret;
}

+static ssize_t queue_wc_show(struct request_queue *q, char *page)
+{
+ if (test_bit(QUEUE_FLAG_WC, &q->queue_flags))
+ return sprintf(page, "write back\n");
+
+ return sprintf(page, "write through\n");
+}
+
+static ssize_t queue_wc_store(struct request_queue *q, const char *page,
+ size_t count)
+{
+ int set = -1;
+
+ if (!strncmp(page, "write back", 10))
+ set = 1;
+ else if (!strncmp(page, "write through", 13) ||
+ !strncmp(page, "none", 4))
+ set = 0;
+
+ if (set == -1)
+ return -EINVAL;
+
+ spin_lock_irq(q->queue_lock);
+ if (set)
+ queue_flag_set(QUEUE_FLAG_WC, q);
+ else
+ queue_flag_clear(QUEUE_FLAG_WC, q);
+ spin_unlock_irq(q->queue_lock);
+
+ return count;
+}
+
static struct queue_sysfs_entry queue_requests_entry = {
.attr = {.name = "nr_requests", .mode = S_IRUGO | S_IWUSR },
.show = queue_requests_show,
@@ -478,6 +510,12 @@ static struct queue_sysfs_entry queue_poll_entry = {
.store = queue_poll_store,
};

+static struct queue_sysfs_entry queue_wc_entry = {
+ .attr = {.name = "write_cache", .mode = S_IRUGO | S_IWUSR },
+ .show = queue_wc_show,
+ .store = queue_wc_store,
+};
+
static struct attribute *default_attrs[] = {
&queue_requests_entry.attr,
&queue_ra_entry.attr,
@@ -503,6 +541,7 @@ static struct attribute *default_attrs[] = {
&queue_iostats_entry.attr,
&queue_random_entry.attr,
&queue_poll_entry.attr,
+ &queue_wc_entry.attr,
NULL,
};

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 7e5d7e018bea..76e875159e52 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -491,15 +491,18 @@ struct request_queue {
#define QUEUE_FLAG_INIT_DONE 20 /* queue is initialized */
#define QUEUE_FLAG_NO_SG_MERGE 21 /* don't attempt to merge SG segments*/
#define QUEUE_FLAG_POLL 22 /* IO polling enabled if set */
+#define QUEUE_FLAG_WC 23 /* Write back caching */

#define QUEUE_FLAG_DEFAULT ((1 << QUEUE_FLAG_IO_STAT) | \
(1 << QUEUE_FLAG_STACKABLE) | \
(1 << QUEUE_FLAG_SAME_COMP) | \
+ (1 << QUEUE_FLAG_WC) | \
(1 << QUEUE_FLAG_ADD_RANDOM))

#define QUEUE_FLAG_MQ_DEFAULT ((1 << QUEUE_FLAG_IO_STAT) | \
(1 << QUEUE_FLAG_STACKABLE) | \
(1 << QUEUE_FLAG_SAME_COMP) | \
+ (1 << QUEUE_FLAG_WC) | \
(1 << QUEUE_FLAG_POLL))

static inline void queue_lockdep_assert_held(struct request_queue *q)
@@ -1009,6 +1012,7 @@ extern void blk_queue_rq_timed_out(struct request_queue *, rq_timed_out_fn *);
extern void blk_queue_rq_timeout(struct request_queue *, unsigned int);
extern void blk_queue_flush(struct request_queue *q, unsigned int flush);
extern void blk_queue_flush_queueable(struct request_queue *q, bool queueable);
+extern void blk_queue_write_cache(struct request_queue *q, bool enabled);
extern struct backing_dev_info *blk_get_backing_dev_info(struct block_device *bdev);

extern int blk_rq_map_sg(struct request_queue *, struct request *, struct scatterlist *);
--
2.4.1.168.g1ea28e1

2016-03-23 15:28:01

by Jens Axboe

[permalink] [raw]

Subject: [PATCH 4/8] writeback: track if we're sleeping on progress in balance_dirty_pages()

Note in the bdi_writeback structure if a task is currently being
limited in balance_dirty_pages(), waiting for writeback to
proceed.

Signed-off-by: Jens Axboe <[email protected]>
---
include/linux/backing-dev-defs.h | 2 ++
mm/page-writeback.c | 2 ++
2 files changed, 4 insertions(+)

diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index 1b4d69f68c33..f702309216b4 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -116,6 +116,8 @@ struct bdi_writeback {
struct list_head work_list;
struct delayed_work dwork; /* work item used for writeback */

+ int dirty_sleeping; /* waiting on dirty limit exceeded */
+
struct list_head bdi_node; /* anchored at bdi->wb_list */

#ifdef CONFIG_CGROUP_WRITEBACK
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 11ff8f758631..15e696bc5d14 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1746,7 +1746,9 @@ pause:
pause,
start_time);
__set_current_state(TASK_KILLABLE);
+ wb->dirty_sleeping = 1;
io_schedule_timeout(pause);
+ wb->dirty_sleeping = 0;

current->dirty_paused_when = now + pause;
current->nr_dirtied = 0;
--
2.4.1.168.g1ea28e1

2016-03-23 15:28:22

by Jens Axboe

[permalink] [raw]

Subject: [PATCH 2/8] writeback: add wbc_to_write()

Add wbc_to_write(), which returns the write type to use, based on a
struct writeback_control. No functional changes in this patch, but it
prepares us for factoring other wbc fields for write type.

Signed-off-by: Jens Axboe <[email protected]>
---
fs/block_dev.c | 2 +-
fs/buffer.c | 2 +-
fs/f2fs/data.c | 2 +-
fs/f2fs/node.c | 2 +-
fs/gfs2/meta_io.c | 3 +--
fs/mpage.c | 9 ++++-----
fs/xfs/xfs_aops.c | 2 +-
include/linux/writeback.h | 8 ++++++++
8 files changed, 18 insertions(+), 12 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 3172c4e2f502..b11d4e08b9a7 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -432,7 +432,7 @@ int bdev_write_page(struct block_device *bdev, sector_t sector,
struct page *page, struct writeback_control *wbc)
{
int result;
- int rw = (wbc->sync_mode == WB_SYNC_ALL) ? WRITE_SYNC : WRITE;
+ int rw = wbc_to_write(wbc);
const struct block_device_operations *ops = bdev->bd_disk->fops;

if (!ops->rw_page || bdev_get_integrity(bdev))
diff --git a/fs/buffer.c b/fs/buffer.c
index 33be29675358..28273caaf2b1 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -1697,7 +1697,7 @@ static int __block_write_full_page(struct inode *inode, struct page *page,
struct buffer_head *bh, *head;
unsigned int blocksize, bbits;
int nr_underway = 0;
- int write_op = (wbc->sync_mode == WB_SYNC_ALL ? WRITE_SYNC : WRITE);
+ int write_op = wbc_to_write(wbc);

head = create_page_buffers(page, inode,
(1 << BH_Dirty)|(1 << BH_Uptodate));
diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
index e5c762b37239..dca5d43c67a3 100644
--- a/fs/f2fs/data.c
+++ b/fs/f2fs/data.c
@@ -1143,7 +1143,7 @@ static int f2fs_write_data_page(struct page *page,
struct f2fs_io_info fio = {
.sbi = sbi,
.type = DATA,
- .rw = (wbc->sync_mode == WB_SYNC_ALL) ? WRITE_SYNC : WRITE,
+ .rw = wbc_to_write(wbc),
.page = page,
.encrypted_page = NULL,
};
diff --git a/fs/f2fs/node.c b/fs/f2fs/node.c
index 118321bd1a7f..db9201f45bf1 100644
--- a/fs/f2fs/node.c
+++ b/fs/f2fs/node.c
@@ -1397,7 +1397,7 @@ static int f2fs_write_node_page(struct page *page,
struct f2fs_io_info fio = {
.sbi = sbi,
.type = NODE,
- .rw = (wbc->sync_mode == WB_SYNC_ALL) ? WRITE_SYNC : WRITE,
+ .rw = wbc_to_write(wbc),
.page = page,
.encrypted_page = NULL,
};
diff --git a/fs/gfs2/meta_io.c b/fs/gfs2/meta_io.c
index e137d96f1b17..ede87306caa5 100644
--- a/fs/gfs2/meta_io.c
+++ b/fs/gfs2/meta_io.c
@@ -37,8 +37,7 @@ static int gfs2_aspace_writepage(struct page *page, struct writeback_control *wb
{
struct buffer_head *bh, *head;
int nr_underway = 0;
- int write_op = REQ_META | REQ_PRIO |
- (wbc->sync_mode == WB_SYNC_ALL ? WRITE_SYNC : WRITE);
+ int write_op = REQ_META | REQ_PRIO | wbc_to_write(wbc);

BUG_ON(!PageLocked(page));
BUG_ON(!page_has_buffers(page));
diff --git a/fs/mpage.c b/fs/mpage.c
index 6bd9fd90964e..9986c752f7bb 100644
--- a/fs/mpage.c
+++ b/fs/mpage.c
@@ -486,7 +486,6 @@ static int __mpage_writepage(struct page *page, struct writeback_control *wbc,
struct buffer_head map_bh;
loff_t i_size = i_size_read(inode);
int ret = 0;
- int wr = (wbc->sync_mode == WB_SYNC_ALL ? WRITE_SYNC : WRITE);

if (page_has_buffers(page)) {
struct buffer_head *head = page_buffers(page);
@@ -595,7 +594,7 @@ page_is_mapped:
* This page will go to BIO. Do we need to send this BIO off first?
*/
if (bio && mpd->last_block_in_bio != blocks[0] - 1)
- bio = mpage_bio_submit(wr, bio);
+ bio = mpage_bio_submit(wbc_to_write(wbc), bio);

alloc_new:
if (bio == NULL) {
@@ -622,7 +621,7 @@ alloc_new:
wbc_account_io(wbc, page, PAGE_SIZE);
length = first_unmapped << blkbits;
if (bio_add_page(bio, page, length, 0) < length) {
- bio = mpage_bio_submit(wr, bio);
+ bio = mpage_bio_submit(wbc_to_write(wbc), bio);
goto alloc_new;
}

@@ -632,7 +631,7 @@ alloc_new:
set_page_writeback(page);
unlock_page(page);
if (boundary || (first_unmapped != blocks_per_page)) {
- bio = mpage_bio_submit(wr, bio);
+ bio = mpage_bio_submit(wbc_to_write(wbc), bio);
if (boundary_block) {
write_boundary_block(boundary_bdev,
boundary_block, 1 << blkbits);
@@ -644,7 +643,7 @@ alloc_new:

confused:
if (bio)
- bio = mpage_bio_submit(wr, bio);
+ bio = mpage_bio_submit(wbc_to_write(wbc), bio);

if (mpd->use_writepage) {
ret = mapping->a_ops->writepage(page, wbc);
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index d445a64b979e..239a612ea1d6 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -393,7 +393,7 @@ xfs_submit_ioend_bio(
atomic_inc(&ioend->io_remaining);
bio->bi_private = ioend;
bio->bi_end_io = xfs_end_bio;
- submit_bio(wbc->sync_mode == WB_SYNC_ALL ? WRITE_SYNC : WRITE, bio);
+ submit_bio(wbc_to_write(wbc), bio);
}

STATIC struct bio *
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index d0b5ca5d4e08..719c255e105a 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -100,6 +100,14 @@ struct writeback_control {
#endif
};

+static inline int wbc_to_write(struct writeback_control *wbc)
+{
+ if (wbc->sync_mode == WB_SYNC_ALL)
+ return WRITE_SYNC;
+
+ return WRITE;
+}
+
/*
* A wb_domain represents a domain that wb's (bdi_writeback's) belong to
* and are measured against each other in. There always is one global
--
2.4.1.168.g1ea28e1

2016-03-23 15:39:40

by Jens Axboe

[permalink] [raw]

Subject: [PATCHSET v2][RFC] Make background writeback not suck

Hi,

Apparently I dropped the subject on this one, it's of course v2 of the
writeback not sucking patchset...

--
Jens Axboe

2016-03-24 17:42:33

by Jens Axboe

[permalink] [raw]

Subject: Re: [PATCHSET v2][RFC] Make background writeback not suck

On 03/23/2016 09:39 AM, Jens Axboe wrote:
> Hi,
>
> Apparently I dropped the subject on this one, it's of course v2 of the
> writeback not sucking patchset...

Some test result. I've run a lot of them, on various types of storage,
and performance seems good with the default settings.

This case reads in a file and writes it to stdout. It targets a certain
latency for the reads - by default it's 10ms. If a read isn't done my
10ms, it'll queue the next read. This avoids the coordinated omission
problem, where one long latency is in fact many of them, you just don't
knows since you don't issue more while one is stuck.

The test case reads a compressed file, and writes it over a pipe to gzip
to decompress it. The input file is around 9G, uncompresses to 20G. At
the end of the run, latency results are shown. Every time the target
latency is exceeded during the run, it's output.

To keep the system busy, 75% (24G) of the memory is taking up by CPU
hogs. This is intended to make the case worse for the throttled depth,
as Dave pointed out.

Out-of-the-box results:

# time (./read-to-pipe-async -f randfile.gz | gzip -dc > outfile; sync)
read latency=11790 usec
read latency=82697 usec
[...]
Latency percentiles (usec) (READERS)
50.0000th: 4
75.0000th: 5
90.0000th: 6
95.0000th: 7
99.0000th: 54
99.5000th: 64
99.9000th: 334
99.9900th: 17952
99.9990th: 101504
99.9999th: 203520
Over=333, min=0, max=215367
Latency percentiles (usec) (WRITERS)
50.0000th: 3
75.0000th: 5
90.0000th: 454
95.0000th: 473
99.0000th: 615
99.5000th: 625
99.9000th: 815
99.9900th: 1142
99.9990th: 2244
99.9999th: 10032
Over=3, min=0, max=10811
Read rate (KB/sec) : 88988
Write rate (KB/sec): 60019

real 2m38.701s
user 2m33.030s
sys 1m31.540s

215ms worst case latency, 333 cases of being above the 10ms target. And
with the patchset applied:

# time (./read-to-pipe-async -f randfile.gz | gzip -dc > outfile; sync)
write latency=15394 usec
[...]
Latency percentiles (usec) (READERS)
50.0000th: 4
75.0000th: 5
90.0000th: 6
95.0000th: 8
99.0000th: 55
99.5000th: 64
99.9000th: 338
99.9900th: 2652
99.9990th: 3964
99.9999th: 7464
Over=1, min=0, max=10221
Latency percentiles (usec) (WRITERS)
50.0000th: 4
75.0000th: 5
90.0000th: 450
95.0000th: 471
99.0000th: 611
99.5000th: 623
99.9000th: 703
99.9900th: 1106
99.9990th: 2010
99.9999th: 10448
Over=6, min=1, max=15394
Read rate (KB/sec) : 95506
Write rate (KB/sec): 59970

real 2m39.014s
user 2m33.800s
sys 1m35.210s

I won't bore you with vmstat output, it's pretty messy for the default case.

--
Jens Axboe