Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757317AbcCaO3o (ORCPT ); Thu, 31 Mar 2016 10:29:44 -0400 Received: from mx0b-00082601.pphosted.com ([67.231.153.30]:13659 "EHLO mx0b-00082601.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756832AbcCaO3l (ORCPT ); Thu, 31 Mar 2016 10:29:41 -0400 Subject: Re: [PATCHSET v3][RFC] Make background writeback not suck To: Dave Chinner References: <1459350477-16404-1-git-send-email-axboe@fb.com> <20160331082433.GO11812@dastard> CC: , , From: Jens Axboe Message-ID: <56FD344F.70908@fb.com> Date: Thu, 31 Mar 2016 08:29:35 -0600 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.6.0 MIME-Version: 1.0 In-Reply-To: <20160331082433.GO11812@dastard> Content-Type: text/plain; charset="windows-1252"; format=flowed Content-Transfer-Encoding: 7bit X-Originating-IP: [192.168.54.13] X-Proofpoint-Spam-Reason: safe X-FB-Internal: Safe X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:,, definitions=2016-03-31_06:,, signatures=0 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 5884 Lines: 121 On 03/31/2016 02:24 AM, Dave Chinner wrote: > On Wed, Mar 30, 2016 at 09:07:48AM -0600, Jens Axboe wrote: >> Hi, >> >> This patchset isn't as much a final solution, as it's demonstration >> of what I believe is a huge issue. Since the dawn of time, our >> background buffered writeback has sucked. When we do background >> buffered writeback, it should have little impact on foreground >> activity. That's the definition of background activity... But for as >> long as I can remember, heavy buffered writers has not behaved like >> that. For instance, if I do something like this: >> >> $ dd if=/dev/zero of=foo bs=1M count=10k >> >> on my laptop, and then try and start chrome, it basically won't start >> before the buffered writeback is done. Or, for server oriented >> workloads, where installation of a big RPM (or similar) adversely >> impacts data base reads or sync writes. When that happens, I get people >> yelling at me. >> >> Last time I posted this, I used flash storage as the example. But >> this works equally well on rotating storage. Let's run a test case >> that writes a lot. This test writes 50 files, each 100M, on XFS on >> a regular hard drive. While this happens, we attempt to read >> another file with fio. >> >> Writers: >> >> $ time (./write-files ; sync) >> real 1m6.304s >> user 0m0.020s >> sys 0m12.210s > > Great. So a basic IO tests looks good - let's through something more > complex at it. Say, a benchmark I've been using for years to stress > the Io subsystem, the filesystem and memory reclaim all at the same > time: a concurent fsmark inode creation test. > (first google hit https://lkml.org/lkml/2013/9/10/46) Is that how you are invoking it as well same arguments? > This generates thousands of REQ_WRITE metadata IOs every second, so > iif I understand how the throttle works correctly, these would be > classified as background writeback by the block layer throttle. > And.... > > FSUse% Count Size Files/sec App Overhead > 0 1600000 0 255845.0 10796891 > 0 3200000 0 261348.8 10842349 > 0 4800000 0 249172.3 14121232 > 0 6400000 0 245172.8 12453759 > 0 8000000 0 201249.5 14293100 > 0 9600000 0 200417.5 29496551 >>>>> 0 11200000 0 90399.6 40665397 > 0 12800000 0 212265.6 21839031 > 0 14400000 0 206398.8 32598378 > 0 16000000 0 197589.7 26266552 > 0 17600000 0 206405.2 16447795 >>>>> 0 19200000 0 99189.6 87650540 > 0 20800000 0 249720.8 12294862 > 0 22400000 0 138523.8 47330007 >>>>> 0 24000000 0 85486.2 14271096 > 0 25600000 0 157538.1 64430611 > 0 27200000 0 109677.8 47835961 > 0 28800000 0 207230.5 31301031 > 0 30400000 0 188739.6 33750424 > 0 32000000 0 174197.9 41402526 > 0 33600000 0 139152.0 100838085 > 0 35200000 0 203729.7 34833764 > 0 36800000 0 228277.4 12459062 >>>>> 0 38400000 0 94962.0 30189182 > 0 40000000 0 166221.9 40564922 >>>>> 0 41600000 0 62902.5 80098461 > 0 43200000 0 217932.6 22539354 > 0 44800000 0 189594.6 24692209 > 0 46400000 0 137834.1 39822038 > 0 48000000 0 240043.8 12779453 > 0 49600000 0 176830.8 16604133 > 0 51200000 0 180771.8 32860221 > > real 5m35.967s > user 3m57.054s > sys 48m53.332s > > In those highlighted report points, the performance has dropped > significantly. The typical range I expect to see ionce memory has > filled (a bit over 8m inodes) is 180k-220k. Runtime on a vanilla > kernel was 4m40s and there were no performance drops, so this > workload runs almost a minute slower with the block layer throttling > code. > > What I see in these performance dips is the XFS transaction > subsystem stalling *completely* - instead of running at a steady > state of around 350,000 transactions/s, there are *zero* > transactions running for periods of up to ten seconds. This > co-incides with the CPU usage falling to almost zero as well. > AFAICT, the only thing that is running when the filesystem stalls > like this is memory reclaim. I'll take a look at this, stalls should definitely not be occurring. How much memory does the box have? > Without the block throttling patches, the workload quickly finds a > steady state of around 7.5-8.5 million cached inodes, and it doesn't > vary much outside those bounds. With the block throttling patches, > on every transaction subsystem stall that occurs, the inode cache > gets 3-4 million inodes trimmed out of it (i.e. half the > cache), and in a couple of cases I saw it trim 6+ million inodes from > the cache before the transactions started up and the cache started > growing again. > >> The above was run without scsi-mq, and with using the deadline scheduler, >> results with CFQ are similary depressing for this test. So IO scheduling >> is in place for this test, it's not pure blk-mq without scheduling. > > virtio in guest, XFS direct IO -> no-op -> scsi in host. That has write back caching enabled on the guest, correct? -- Jens Axboe