Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1759974AbXHEIEy (ORCPT ); Sun, 5 Aug 2007 04:04:54 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1755427AbXHEIEf (ORCPT ); Sun, 5 Aug 2007 04:04:35 -0400 Received: from dsl081-085-152.lax1.dsl.speakeasy.net ([64.81.85.152]:55631 "EHLO moonbase.phunq.net" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1753799AbXHEIE2 (ORCPT ); Sun, 5 Aug 2007 04:04:28 -0400 From: Daniel Phillips To: Evgeniy Polyakov Subject: Re: Distributed storage. Date: Sun, 5 Aug 2007 01:04:19 -0700 User-Agent: KMail/1.9.5 Cc: netdev@vger.kernel.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, Peter Zijlstra References: <20070731171347.GA14267@2ka.mipt.ru> <200708031819.17039.phillips@phunq.net> <20070804163740.GB14175@2ka.mipt.ru> In-Reply-To: <20070804163740.GB14175@2ka.mipt.ru> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200708050104.19596.phillips@phunq.net> Sender: linux-kernel-owner@vger.kernel.org X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 6265 Lines: 112 On Saturday 04 August 2007 09:37, Evgeniy Polyakov wrote: > On Fri, Aug 03, 2007 at 06:19:16PM -0700, I wrote: > > To be sure, I am not very proud of this throttling mechanism for > > various reasons, but the thing is, _any_ throttling mechanism no > > matter how sucky solves the deadlock problem. Over time I want to > > move the > > make_request_fn is always called in process context, Yes, as is submit_bio which calls it. The decision re where it is best to throttle, in submit_bio or in make_request_fn, has more to do with system factoring, that is, is throttling something that _every_ block device should have (yes I think) or is it a delicate, optional thing that needs a tweakable algorithm per block device type (no I think). The big worry I had was that by blocking on congestion in the submit_bio/make_request_fn I might stuff up system-wide mm writeout. But a while ago that part of the mm was tweaked (by Andrew if I recall correctly) to use a pool of writeout threads and understand the concept of one of them blocking on some block device, and not submit more writeout to the same block device until the first thread finishes its submission. Meanwhile, other mm writeout threads carry on with other block devices. > we can wait in it for memory in mempool. Although that means we > already in trouble. Not at all. This whole block writeout path needs to be written to run efficiently even when normal system memory is completely gone. All it means when we wait on a mempool is that the block device queue is as full as we are ever going to let it become, and that means the block device is working as hard as it can (subject to a small caveat: for some loads a device can work more efficiently if it can queue up larger numbers of requests down at the physical elevators). By the way, ddsnap waits on a counting semaphore, not a mempool. That is because we draw our reserve memory from the global memalloc reserve, not from a mempool. And that is not only because it takes less code to do so, but mainly because global pools as opposed to lots of little special purpose pools seem like a good idea to me. Though I will admit that with our current scheme we need to allow for the total of the maximum reserve requirements for all memalloc users in the memalloc pool, so it does not actually save any memory vs dedicated pools. We could improve that if we wanted to, by having hard and soft reserve requirements: the global reserve actually only needs to be as big as the total of the hard requirements. With this idea, if by some unlucky accident every single pool user got itself maxed out at the same time, we would still not exceed our share of the global reserve. Under "normal" low memory situations, a block device would typically be free to grab reserve memory up to its soft limit, allowing it to optimize over a wider range of queued transactions. My little idea here is: allocating specific pages to a pool is kind of dumb, all we really want to do is account precisely for the number of pages we are allowed to draw from the global reserve. OK, I kind of digressed, but this all counts as explaining the details of what Peter and I have been up to for the last year (longer for me). At this point, we don't need to do the reserve accounting in the most absolutely perfect way possible, we just need to get something minimal in place to fix the current deadlock problems, then we can iteratively improve it. > I agree, any kind of high-boundary leveling must be implemented in > device itself, since block layer does not know what device is at the > end and what it will need to process given block request. I did not say the throttling has to be implemented in the device, only that we did it there because it was easiest to code that up and try it out (it worked). This throttling really wants to live at a higher level, possibly submit_bio()...bio->endio(). Someone at OLS (James Bottomley?) suggested it would be better done at the request queue layer, but I do not immediately see why that should be. I guess this is going to come down to somebody throwing out a patch for interested folks to poke at. But this detail is a fine point. The big point is to have _some_ throttling mechanism in place on the block IO path, always. Device mapper in particular does not have any throttling itself: calling submit_bio on a device mapper device directly calls the device mapper bio dispatcher. Default initialized block device queue do provide a crude form of throttling based on limiting the number of requests. This is insufficiently precise to do a good job in the long run, but it works for now because the current gaggle of low level block drivers do not have a lot of resource requirements and tend to behave fairly predictably (except for some irritating issues re very slow devices working in parallel with very fast devices, but... worry about that later). Network block drivers - for example your driver - do have nontrivial resource requirements and they do not, as far as I can see, have any form of throttling on the upstream side. So unless you can demonstrate I'm wrong (I would be more than happy about that) then we are going to need to add some. Anyway, I digressed again. _Every_ layer in a block IO stack needs to have a reserve, if it consumes memory. So we can't escape the question of how big to make those reserves by trying to push it all down to the lowest level, hoping that the low level device knows more about how many requests it will have in flight. For the time being, we will just plug in some seat of the pants numbers in classic Linux fashion and that will serve us until somebody gets around to inventing the one true path discovery mechanism that can sniff around in the block IO stack and figure out the optimal amount of system resources to reserve at each level, which ought to be worth at least a master's thesis for somebody. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/