From: Jens Axboe <axboe@fb.com>
To: <linux-kernel@vger.kernel.org>, <linux-fsdevel@vger.kernel.org>,
        <linux-block@vger.kernel.org>
Subject: [PATCHSET v3][RFC] Make background writeback not suck
Date: Wed, 30 Mar 2016 09:07:48 -0600
Message-ID: <1459350477-16404-1-git-send-email-axboe@fb.com>
MIME-Version: 1.0
Content-Type: text/plain
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 6976
Lines: 182

Hi,

This patchset isn't as much a final solution, as it's demonstration
of what I believe is a huge issue. Since the dawn of time, our
background buffered writeback has sucked. When we do background
buffered writeback, it should have little impact on foreground
activity. That's the definition of background activity... But for as
long as I can remember, heavy buffered writers has not behaved like
that. For instance, if I do something like this:

$ dd if=/dev/zero of=foo bs=1M count=10k

on my laptop, and then try and start chrome, it basically won't start
before the buffered writeback is done. Or, for server oriented
workloads, where installation of a big RPM (or similar) adversely
impacts data base reads or sync writes. When that happens, I get people
yelling at me.

Last time I posted this, I used flash storage as the example. But
this works equally well on rotating storage. Let's run a test case
that writes a lot. This test writes 50 files, each 100M, on XFS on
a regular hard drive. While this happens, we attempt to read
another file with fio.

Writers:

$ time (./write-files ; sync)
real	1m6.304s
user	0m0.020s
sys	0m12.210s

Fio reader:

  read : io=35580KB, bw=550868B/s, iops=134, runt= 66139msec
    clat (usec): min=40, max=654204, avg=7432.37, stdev=43872.83
     lat (usec): min=40, max=654204, avg=7432.70, stdev=43872.83
    clat percentiles (usec):
     |  1.00th=[   41],  5.00th=[   41], 10.00th=[   41], 20.00th=[   42],
     | 30.00th=[   42], 40.00th=[   42], 50.00th=[   43], 60.00th=[   52],
     | 70.00th=[   59], 80.00th=[   65], 90.00th=[   87], 95.00th=[ 1192],
     | 99.00th=[254976], 99.50th=[358400], 99.90th=[444416], 99.95th=[468992],
     | 99.99th=[651264]


Let's run the same test, but with the patches applied, and wb_percent
set to 10%:

Writers:

$ time (./write-files ; sync)
real	1m29.384s
user	0m0.040s
sys	0m10.810s

Fio reader:

  read : io=1024.0MB, bw=18640KB/s, iops=4660, runt= 56254msec
    clat (usec): min=39, max=408400, avg=212.05, stdev=2982.44
     lat (usec): min=39, max=408400, avg=212.30, stdev=2982.44
    clat percentiles (usec):
     |  1.00th=[   40],  5.00th=[   41], 10.00th=[   41], 20.00th=[   41],
     | 30.00th=[   42], 40.00th=[   42], 50.00th=[   42], 60.00th=[   42],
     | 70.00th=[   43], 80.00th=[   45], 90.00th=[   56], 95.00th=[   60],
     | 99.00th=[  454], 99.50th=[ 8768], 99.90th=[36608], 99.95th=[43264],
     | 99.99th=[69120]


Much better, looking at the P99.x percentiles, and of course on
the bandwidth front as well. It's the difference between this:

---io---- -system-- ------cpu-----
 bi    bo   in   cs us sy id wa st
 20636 45056 5593 10833  0  0 94  6  0
 16416 46080 4484 8666  0  0 94  6  0
 16960 47104 5183 8936  0  0 94  6  0

and this

---io---- -system-- ------cpu-----
 bi    bo   in   cs us sy id wa st
   384 73728  571  558  0  0 95  5  0
   384 73728  548  545  0  0 95  5  0
   388 73728  575  763  0  0 96  4  0

in the vmstat output. It's not quite as bad as deeper queue depth
devices, where we have hugely bursty IO, but it's still very slow.

If we don't run the competing reader, the dirty data writeback proceeds
at normal rates:

# time (./write-files ; sync)
real	1m6.919s
user	0m0.010s
sys	0m10.900s


The above was run without scsi-mq, and with using the deadline scheduler,
results with CFQ are similary depressing for this test. So IO scheduling
is in place for this test, it's not pure blk-mq without scheduling.

The above was the why. The how is basically throttling background
writeback. We still want to issue big writes from the vm side of things,
so we get nice and big extents on the file system end. But we don't need
to flood the device with THOUSANDS of requests for background writeback.
For most devices, we don't need a whole lot to get decent throughput.

This adds some simple blk-wb code that keeps limits how much buffered
writeback we keep in flight on the device end. The default is pretty
low. If we end up switching to WB_SYNC_ALL, we up the limits. If the
dirtying task ends up being throttled in balance_dirty_pages(), we up
the limit. If we need to reclaim memory, we up the limit. The cases
that need to clean memory at or near device speeds, they get to do
that. We still don't need thousands of requests to accomplish that.
And for the cases where we don't need to be near device limits, we
can clean at a more reasonable pace. See the last patch in the series
for a more detailed description of the change, and the tunable.

I welcome testing. If you are sick of Linux bogging down when buffered
writes are happening, then this is for you, laptop or server. The
patchset is fully stable, I have not observed problems. It passes full
xfstest runs, and a variety of benchmarks as well. It works equally well
on blk-mq/scsi-mq, and "classic" setups.

You can also find this in a branch in the block git repo:

git://git.kernel.dk/linux-block.git wb-buf-throttle

Note that I rebase this branch when I collapse patches. Patches are
against current Linus' git, 4.6.0-rc1, I can make them available
against 4.5 as well, if there's any interest in that for test
purposes.

Changes since v2

- Switch from wb_depth to wb_percent, as that's an easier tunable.
- Add the patch to track device depth on the block layer side.
- Cleanup the limiting code.
- Don't use a fixed limit in the wb wait, since it can change
  between wakeups.
- Minor tweaks, fixups, cleanups.

Changes since v1

- Drop sync() WB_SYNC_NONE -> WB_SYNC_ALL change
- wb_start_writeback() fills in background/reclaim/sync info in
  the writeback work, based on writeback reason.
- Use WRITE_SYNC for reclaim/sync IO
- Split balance_dirty_pages() sleep change into separate patch
- Drop get_request() u64 flag change, set the bit on the request
  directly after-the-fact.
- Fix wrong sysfs return value
- Various small cleanups


 block/Makefile                   |    2 
 block/blk-core.c                 |   15 ++
 block/blk-mq.c                   |   31 ++++-
 block/blk-settings.c             |   20 +++
 block/blk-sysfs.c                |  128 ++++++++++++++++++++
 block/blk-wb.c                   |  238 +++++++++++++++++++++++++++++++++++++++
 block/blk-wb.h                   |   33 +++++
 drivers/nvme/host/core.c         |    1 
 drivers/scsi/scsi.c              |    3 
 drivers/scsi/sd.c                |    5 
 fs/block_dev.c                   |    2 
 fs/buffer.c                      |    2 
 fs/f2fs/data.c                   |    2 
 fs/f2fs/node.c                   |    2 
 fs/fs-writeback.c                |   13 ++
 fs/gfs2/meta_io.c                |    3 
 fs/mpage.c                       |    9 -
 fs/xfs/xfs_aops.c                |    2 
 include/linux/backing-dev-defs.h |    2 
 include/linux/blk_types.h        |    2 
 include/linux/blkdev.h           |   18 ++
 include/linux/writeback.h        |    8 +
 mm/page-writeback.c              |    2 
 23 files changed, 527 insertions(+), 16 deletions(-)


-- 
Jens Axboe