Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754241AbcC3PIM (ORCPT ); Wed, 30 Mar 2016 11:08:12 -0400 Received: from mx0b-00082601.pphosted.com ([67.231.153.30]:5485 "EHLO mx0b-00082601.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753925AbcC3PIH (ORCPT ); Wed, 30 Mar 2016 11:08:07 -0400 From: Jens Axboe To: , , Subject: [PATCHSET v3][RFC] Make background writeback not suck Date: Wed, 30 Mar 2016 09:07:48 -0600 Message-ID: <1459350477-16404-1-git-send-email-axboe@fb.com> X-Mailer: git-send-email 2.8.0.rc4.6.g7e4ba36 MIME-Version: 1.0 Content-Type: text/plain X-Originating-IP: [192.168.54.13] X-Proofpoint-Spam-Reason: safe X-FB-Internal: Safe X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:,, definitions=2016-03-30_08:,, signatures=0 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 6976 Lines: 182 Hi, This patchset isn't as much a final solution, as it's demonstration of what I believe is a huge issue. Since the dawn of time, our background buffered writeback has sucked. When we do background buffered writeback, it should have little impact on foreground activity. That's the definition of background activity... But for as long as I can remember, heavy buffered writers has not behaved like that. For instance, if I do something like this: $ dd if=/dev/zero of=foo bs=1M count=10k on my laptop, and then try and start chrome, it basically won't start before the buffered writeback is done. Or, for server oriented workloads, where installation of a big RPM (or similar) adversely impacts data base reads or sync writes. When that happens, I get people yelling at me. Last time I posted this, I used flash storage as the example. But this works equally well on rotating storage. Let's run a test case that writes a lot. This test writes 50 files, each 100M, on XFS on a regular hard drive. While this happens, we attempt to read another file with fio. Writers: $ time (./write-files ; sync) real 1m6.304s user 0m0.020s sys 0m12.210s Fio reader: read : io=35580KB, bw=550868B/s, iops=134, runt= 66139msec clat (usec): min=40, max=654204, avg=7432.37, stdev=43872.83 lat (usec): min=40, max=654204, avg=7432.70, stdev=43872.83 clat percentiles (usec): | 1.00th=[ 41], 5.00th=[ 41], 10.00th=[ 41], 20.00th=[ 42], | 30.00th=[ 42], 40.00th=[ 42], 50.00th=[ 43], 60.00th=[ 52], | 70.00th=[ 59], 80.00th=[ 65], 90.00th=[ 87], 95.00th=[ 1192], | 99.00th=[254976], 99.50th=[358400], 99.90th=[444416], 99.95th=[468992], | 99.99th=[651264] Let's run the same test, but with the patches applied, and wb_percent set to 10%: Writers: $ time (./write-files ; sync) real 1m29.384s user 0m0.040s sys 0m10.810s Fio reader: read : io=1024.0MB, bw=18640KB/s, iops=4660, runt= 56254msec clat (usec): min=39, max=408400, avg=212.05, stdev=2982.44 lat (usec): min=39, max=408400, avg=212.30, stdev=2982.44 clat percentiles (usec): | 1.00th=[ 40], 5.00th=[ 41], 10.00th=[ 41], 20.00th=[ 41], | 30.00th=[ 42], 40.00th=[ 42], 50.00th=[ 42], 60.00th=[ 42], | 70.00th=[ 43], 80.00th=[ 45], 90.00th=[ 56], 95.00th=[ 60], | 99.00th=[ 454], 99.50th=[ 8768], 99.90th=[36608], 99.95th=[43264], | 99.99th=[69120] Much better, looking at the P99.x percentiles, and of course on the bandwidth front as well. It's the difference between this: ---io---- -system-- ------cpu----- bi bo in cs us sy id wa st 20636 45056 5593 10833 0 0 94 6 0 16416 46080 4484 8666 0 0 94 6 0 16960 47104 5183 8936 0 0 94 6 0 and this ---io---- -system-- ------cpu----- bi bo in cs us sy id wa st 384 73728 571 558 0 0 95 5 0 384 73728 548 545 0 0 95 5 0 388 73728 575 763 0 0 96 4 0 in the vmstat output. It's not quite as bad as deeper queue depth devices, where we have hugely bursty IO, but it's still very slow. If we don't run the competing reader, the dirty data writeback proceeds at normal rates: # time (./write-files ; sync) real 1m6.919s user 0m0.010s sys 0m10.900s The above was run without scsi-mq, and with using the deadline scheduler, results with CFQ are similary depressing for this test. So IO scheduling is in place for this test, it's not pure blk-mq without scheduling. The above was the why. The how is basically throttling background writeback. We still want to issue big writes from the vm side of things, so we get nice and big extents on the file system end. But we don't need to flood the device with THOUSANDS of requests for background writeback. For most devices, we don't need a whole lot to get decent throughput. This adds some simple blk-wb code that keeps limits how much buffered writeback we keep in flight on the device end. The default is pretty low. If we end up switching to WB_SYNC_ALL, we up the limits. If the dirtying task ends up being throttled in balance_dirty_pages(), we up the limit. If we need to reclaim memory, we up the limit. The cases that need to clean memory at or near device speeds, they get to do that. We still don't need thousands of requests to accomplish that. And for the cases where we don't need to be near device limits, we can clean at a more reasonable pace. See the last patch in the series for a more detailed description of the change, and the tunable. I welcome testing. If you are sick of Linux bogging down when buffered writes are happening, then this is for you, laptop or server. The patchset is fully stable, I have not observed problems. It passes full xfstest runs, and a variety of benchmarks as well. It works equally well on blk-mq/scsi-mq, and "classic" setups. You can also find this in a branch in the block git repo: git://git.kernel.dk/linux-block.git wb-buf-throttle Note that I rebase this branch when I collapse patches. Patches are against current Linus' git, 4.6.0-rc1, I can make them available against 4.5 as well, if there's any interest in that for test purposes. Changes since v2 - Switch from wb_depth to wb_percent, as that's an easier tunable. - Add the patch to track device depth on the block layer side. - Cleanup the limiting code. - Don't use a fixed limit in the wb wait, since it can change between wakeups. - Minor tweaks, fixups, cleanups. Changes since v1 - Drop sync() WB_SYNC_NONE -> WB_SYNC_ALL change - wb_start_writeback() fills in background/reclaim/sync info in the writeback work, based on writeback reason. - Use WRITE_SYNC for reclaim/sync IO - Split balance_dirty_pages() sleep change into separate patch - Drop get_request() u64 flag change, set the bit on the request directly after-the-fact. - Fix wrong sysfs return value - Various small cleanups block/Makefile | 2 block/blk-core.c | 15 ++ block/blk-mq.c | 31 ++++- block/blk-settings.c | 20 +++ block/blk-sysfs.c | 128 ++++++++++++++++++++ block/blk-wb.c | 238 +++++++++++++++++++++++++++++++++++++++ block/blk-wb.h | 33 +++++ drivers/nvme/host/core.c | 1 drivers/scsi/scsi.c | 3 drivers/scsi/sd.c | 5 fs/block_dev.c | 2 fs/buffer.c | 2 fs/f2fs/data.c | 2 fs/f2fs/node.c | 2 fs/fs-writeback.c | 13 ++ fs/gfs2/meta_io.c | 3 fs/mpage.c | 9 - fs/xfs/xfs_aops.c | 2 include/linux/backing-dev-defs.h | 2 include/linux/blk_types.h | 2 include/linux/blkdev.h | 18 ++ include/linux/writeback.h | 8 + mm/page-writeback.c | 2 23 files changed, 527 insertions(+), 16 deletions(-) -- Jens Axboe