Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753040AbcCVW5Q (ORCPT ); Tue, 22 Mar 2016 18:57:16 -0400 Received: from mx0a-00082601.pphosted.com ([67.231.145.42]:8063 "EHLO mx0a-00082601.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752712AbcCVW5N (ORCPT ); Tue, 22 Mar 2016 18:57:13 -0400 Subject: Re: [PATCHSET][RFC] Make background writeback not suck To: Dave Chinner References: <1458669320-6819-1-git-send-email-axboe@fb.com> <20160322215122.GS11812@dastard> <56F1C130.8020200@fb.com> <20160322223156.GU11812@dastard> CC: , , From: Jens Axboe Message-ID: <56F1CDC5.9010006@fb.com> Date: Tue, 22 Mar 2016 16:57:09 -0600 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.6.0 MIME-Version: 1.0 In-Reply-To: <20160322223156.GU11812@dastard> Content-Type: text/plain; charset="windows-1252"; format=flowed Content-Transfer-Encoding: 7bit X-Originating-IP: [192.168.54.13] X-Proofpoint-Spam-Reason: safe X-FB-Internal: Safe X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:,, definitions=2016-03-22_09:,, signatures=0 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 8596 Lines: 176 On 03/22/2016 04:31 PM, Dave Chinner wrote: > On Tue, Mar 22, 2016 at 04:03:28PM -0600, Jens Axboe wrote: >> On 03/22/2016 03:51 PM, Dave Chinner wrote: >>> On Tue, Mar 22, 2016 at 11:55:14AM -0600, Jens Axboe wrote: >>>> This patchset isn't as much a final solution, as it's demonstration >>>> of what I believe is a huge issue. Since the dawn of time, our >>>> background buffered writeback has sucked. When we do background >>>> buffered writeback, it should have little impact on foreground >>>> activity. That's the definition of background activity... But for as >>>> long as I can remember, heavy buffered writers has not behaved like >>>> that. >>> >>> Of course not. The IO scheduler is supposed to determine how we >>> meter out bulk vs latency sensitive IO that is queued. That's what >>> all the things like anticipatory scheduling for read requests was >>> supposed to address.... >>> >>> I'm guessing you're seeing problems like this because blk-mq has no >>> IO scheduler infrastructure and so no way of prioritising, >>> scheduling and/or throttling different types of IO? Would that be >>> accurate? >> >> It's not just that, but obviously the IO scheduler would be one >> place to throttle it. This, in a way, is a way of scheduling the >> writeback writes better. But most of the reports I get on writeback >> sucking is not using scsi/blk-mq, they end up being "classic" on >> things like deadline. > > Deadline doesn't have anticipatory read scheduling, right? > > Really, I'm just trying to understand why this isn't being added as > part of the IO scheduler infrastructure, but is instead adding > another layer of non-optional IO scheduling to the block layer... This is less about IO scheduling than it is about making sure that background writeback isn't too intrusive. Yes, you can argue that that is a form of IO scheduling. But so is the rate detection, for instance, and some of the other limits we put in. But we don't properly limit depth, and I think we should. Anticipatory read scheduling is a completely different animal, that is centered around avoid long seeks on rotational media, assuming that there's locality on disk for related (and back-to-back) reads. This can easily be part of IO scheduling infrastructure, it already somewhat is, given where it's placed. >>>> The read starts out fine, but goes to shit when we start bacckground >>>> flushing. The reader experiences latency spikes in the seconds range. >>>> On flash. >>>> >>>> With this set of patches applies, the situation looks like this instead: >>>> >>>> --io---- -system-- ------cpu----- >>>> bi bo in cs us sy id wa st >>>> 33544 0 8650 17204 0 1 97 2 0 >>>> 42488 0 10856 21756 0 0 97 3 0 >>>> 42032 0 10719 21384 0 0 97 3 0 >>>> 42544 12 10838 21631 0 0 97 3 0 >>>> 42620 0 10982 21727 0 3 95 3 0 >>>> 46392 0 11923 23597 0 3 94 3 0 >>>> 36268 512000 9907 20044 0 3 91 5 0 >>>> 31572 696324 8840 18248 0 1 91 7 0 >>>> 30748 626692 8617 17636 0 2 91 6 0 >>>> 31016 618504 8679 17736 0 3 91 6 0 >>>> 30612 648196 8625 17624 0 3 91 6 0 >>>> 30992 650296 8738 17859 0 3 91 6 0 >>>> 30680 604075 8614 17605 0 3 92 6 0 >>>> 30592 595040 8572 17564 0 2 92 6 0 >>>> 31836 539656 8819 17962 0 2 92 5 0 >>> >>> And now it runs at ~600MB/s, slowing down the rate at which memory >>> is cleaned by 60%. >> >> Which is the point, correct... If we're not anywhere near being >> tight on memory AND nobody is waiting for this IO, then by >> definition, the foreground activity is the important one. For the >> case used here, that's the application doing reads. > > Unless, of course, we are in a situation where there is also large > memory demand, and we need to clean memory fast.... If that's the case, then we should ramp up limits. The important part here is that we currently issue thousands of requests. Literally thousands. Do we need thousands to get optimal throughput? No. In fact it's detrimental to other system activity, without providing any benefits. The idea here is to throttle us within a given window of depth, let's call that 1..N, where N is enough to get us full write performance. If we're doing pure background writes, QD is well less than N. If the urgency increases, we'll ramp it up. But for all cases, we're avoiding the situation of having several thousand of requests in flight. >>> Given that background writeback is relied on by memory reclaim to >>> clean memory faster than the LRUs are cycled, I suspect this is >>> going to have a big impact on low memory behaviour and balance, >>> which will then feed into IO breakdown problems caused by writeback >>> being driven from the LRUs rather than the flusher threads..... >> >> You're missing the part where the intent is to only throttle it >> heavily when it's pure background writeback. Of course, if we are >> low on memory and doing reclaim, we should get much closer to device >> bandwidth. > > A demonstration, please. > > I didn't see anything in the code that treated low memory conditions > differently - that just uses > wakeup_flusher_threads(WB_REASON_TRY_TO_FREE_PAGES) from > do_try_to_free_pages() to trigger background writeback to run and > clean pages, so I'm interested to see exactly how that works out... It's not all there yet, this is an early RFC. The case that is handled is the application being blocked on dirtying memory, we increase bandwidth for that case. I'll look into reclaim next, honestly don't see this as somethings that's hard to handle. >> If I run the above dd without the reader running, I'm already at 90% >> of the device bandwidth - not quite all the way there, since I still >> want to quickly be able to inject reads (or other IO) without having >> to wait for the queues to purge thousands of requests. > > So, essentially, the model is to run background write at "near > starvation" queue depths, which works fine when the system is mostly > idle and we can dispatch more IO immediately. My concern with this > model is that under heavy IO and CPU load, writeback dispatch often > has significant delays (e.g. for allocation, etc). This is when we > need deeper queue depths to maintain throughput across dispatch > latency variations. Not near starvation. For most devices, it doesn't take a lot of writes to get very close to max performance. I don't want to starve the device, that's not the intention here. And if writeback kworkers being CPU starved is a concern, then yes, of course that needs to be handled appropriately. If we're in or near a critical condition, then writeback must proceed swiftly. For the general use case of that NOT being the case, then we can limit writeback a bit and get much better system behavior for most applications and users. > Many production workloads don't care about read latency, but do care > about bulk page cache throughput. Such workloads are going to be > adversely affected by a fundamental block layer IO dispatch model > change like this. This is why we have the pluggable IO schedulers in > the first place - one size does not fit all. I'd counter that with most production workloads DO care about IO latencies, be it reads or writes. In fact it's often the single most important thing that people complain about. When provisioning hardware or setups, we're often in the situation that we have to drive things softer than we would otherwise want to, to get more predictable behavior. The current writeback is the opposite of predictable, unless you consider the fact that it predictably shits itself with basic operations like buffered writes. On example is a user deliberately dirtying at X MB/sec, where X is well below what the device can handle. That user know has two choices: 1) Do nothing, and incur the wrath of periodic writeback that issues a ton of requests, disturbing everything else. 2) Insert periodic sync points to avoid queuing up too much. The user needs to be able to block to do that, so that means offloading to a thread. Neither of those are appealing choices. Wouldn't it be great if the user didn't have to do anything, and have the system behave in a reasonable manner? I think so. > Hence I'm thinking that this should not be applied to all block > devices as this patch does, but instead be a part of the io > scheduling infrastructure we already have (and need for blk-mq). Sure, that's a placement concern, and that's something that can be worked on. I feel like we're arguing in circles here. -- Jens Axboe