2007-08-02 21:09:23

by Daniel Phillips

[permalink] [raw]
Subject: Re: Distributed storage.

On Tuesday 31 July 2007 10:13, Evgeniy Polyakov wrote:
> Hi.
>
> I'm pleased to announce first release of the distributed storage
> subsystem, which allows to form a storage on top of remote and local
> nodes, which in turn can be exported to another storage as a node to
> form tree-like storages.

Excellent! This is precisely what the doctor ordered for the
OCFS2-based distributed storage system I have been mumbling about for
some time. In fact the dd in ddsnap and ddraid stands for "distributed
data". The ddsnap/raid devices do not include an actual network
transport, that is expected to be provided by a specialized block
device, which up till now has been NBD. But NBD has various
deficiencies as you note, in addition to its tendency to deadlock when
accessed locally. Your new code base may be just the thing we always
wanted. We (zumastor et al) will take it for a drive and see if
anything breaks.

Memory deadlock is a concern of course. From a cursory glance through,
it looks like this code is pretty vm-friendly and you have thought
quite a lot about it, however I respectfully invite peterz
(obsessive/compulsive memory deadlock hunter) to help give it a good
going over with me.

I see bits that worry me, e.g.:

+ req = mempool_alloc(st->w->req_pool, GFP_NOIO);

which seems to be callable in response to a local request, just the case
where NBD deadlocks. Your mempool strategy can work reliably only if
you can prove that the pool allocations of the maximum number of
requests you can have in flight do not exceed the size of the pool. In
other words, if you ever take the pool's fallback path to normal
allocation, you risk deadlock.

Anyway, if this is as grand as it seems then I would think we ought to
factor out a common transfer core that can be used by all of NBD,
iSCSI, ATAoE and your own kernel server, in place of the roll-yer-own
code those things have now.

Regards,

Daniel


2007-08-03 10:30:08

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: Distributed storage.

On Thu, Aug 02, 2007 at 02:08:24PM -0700, Daniel Phillips ([email protected]) wrote:
> On Tuesday 31 July 2007 10:13, Evgeniy Polyakov wrote:
> > Hi.
> >
> > I'm pleased to announce first release of the distributed storage
> > subsystem, which allows to form a storage on top of remote and local
> > nodes, which in turn can be exported to another storage as a node to
> > form tree-like storages.
>
> Excellent! This is precisely what the doctor ordered for the
> OCFS2-based distributed storage system I have been mumbling about for
> some time. In fact the dd in ddsnap and ddraid stands for "distributed
> data". The ddsnap/raid devices do not include an actual network
> transport, that is expected to be provided by a specialized block
> device, which up till now has been NBD. But NBD has various
> deficiencies as you note, in addition to its tendency to deadlock when
> accessed locally. Your new code base may be just the thing we always
> wanted. We (zumastor et al) will take it for a drive and see if
> anything breaks.

That would be great.

> Memory deadlock is a concern of course. From a cursory glance through,
> it looks like this code is pretty vm-friendly and you have thought
> quite a lot about it, however I respectfully invite peterz
> (obsessive/compulsive memory deadlock hunter) to help give it a good
> going over with me.
>
> I see bits that worry me, e.g.:
>
> + req = mempool_alloc(st->w->req_pool, GFP_NOIO);
>
> which seems to be callable in response to a local request, just the case
> where NBD deadlocks. Your mempool strategy can work reliably only if
> you can prove that the pool allocations of the maximum number of
> requests you can have in flight do not exceed the size of the pool. In
> other words, if you ever take the pool's fallback path to normal
> allocation, you risk deadlock.

mempool should be allocated to be able to catch up with maximum
in-flight requests, in my tests I was unable to force block layer to put
more than 31 pages in sync, but in one bio. Each request is essentially
dealyed bio processing, so this must handle maximum number of in-flight
bios (if they do not cover multiple nodes, if they do, then each node
requires own request). Sync has one bio in-flight on my machines (from
tiny VIA nodes to low-end amd64), number of normal requests *usually*
does not increase several dozens (less than hundred always), but that
might be only my small systems, so request size was selected as small as
possible and number of allocations decreased to absolutely healthcare
minimum.

> Anyway, if this is as grand as it seems then I would think we ought to
> factor out a common transfer core that can be used by all of NBD,
> iSCSI, ATAoE and your own kernel server, in place of the roll-yer-own
> code those things have now.
>
> Regards,
>
> Daniel

Thanks.

--
Evgeniy Polyakov

2007-08-03 10:58:21

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: Distributed storage.

On Fri, Aug 03, 2007 at 02:26:29PM +0400, Evgeniy Polyakov ([email protected]) wrote:
> > Memory deadlock is a concern of course. From a cursory glance through,
> > it looks like this code is pretty vm-friendly and you have thought
> > quite a lot about it, however I respectfully invite peterz
> > (obsessive/compulsive memory deadlock hunter) to help give it a good
> > going over with me.

Another major issue is network allocations.

Your initial work and subsequent releases made by Peter were originally
opposed on my side, but now I think the right way is to use both
positive moments from your approach and specialized allocator -
essentially what I proposed (in the blog only though) is to bind a
independent reserve for any socket - such a reserve can be stolen from
socket buffer itself (each socket has a limited socket buffer where
packets are allocated from, it accounts both data and control (skb)
lengths), so when main allocation via common path fails, it would be
possible to get data from own reserve. This allows sending sockets to
make a progress in case of deadlock.

For receiving situation is worse, since system does not know in advance
to which socket given packet will belong to, so it must allocate from
global pool (and thus there must be independent global reserve), and
then exchange part of the socket's reserve to the global one (or just
copy packet to the new one, allocated from socket's reseve is it was
setup, or drop it otherwise). Global independent reserve is what I
proposed when stopped to advertise network allocator, but it seems that
it was not taken into account, and reserve was always allocated only
when system has serious memory pressure in Peter's patches without any
meaning for per-socket reservation.

It allows to separate sockets and effectively make them fair - system
administrator or programmer can limit socket's buffer a bit and request
a reserve for special communication channels, which will have guaranteed
ability to have both sending and receiving progress, no matter how many
of them were setup. And it does not require any changes behind network
side.

--
Evgeniy Polyakov

2007-08-03 12:28:42

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Distributed storage.

On Fri, 2007-08-03 at 14:57 +0400, Evgeniy Polyakov wrote:

> For receiving situation is worse, since system does not know in advance
> to which socket given packet will belong to, so it must allocate from
> global pool (and thus there must be independent global reserve), and
> then exchange part of the socket's reserve to the global one (or just
> copy packet to the new one, allocated from socket's reseve is it was
> setup, or drop it otherwise). Global independent reserve is what I
> proposed when stopped to advertise network allocator, but it seems that
> it was not taken into account, and reserve was always allocated only
> when system has serious memory pressure in Peter's patches without any
> meaning for per-socket reservation.

This is not true. I have a global reserve which is set-up a priori. You
cannot allocate a reserve when under pressure, that does not make sense.

Let me explain my approach once again.

At swapon(8) time we allocate a global reserve. And associate the needed
sockets with it. The size of this global reserve is make up of two
parts:
- TX
- RX

The RX pool is the most interresting part. It again is made up of two
parts:
- skb
- auxilary data

The skb part is scaled such that it can overflow the IP fragment
reassembly, the aux pool such that it can overflow the route cache (that
was the largest other allocator in the RX path)

All (reserve) RX skb allocations are accounted, so as to never allocate
more than we reserved.

All packets are received (given the limit) and are processed up to
socket demux. At that point all packets not targeted at an associated
socket are dropped and the skb memory freed - ready for another packet.

All packets targeted for associated sockets get processed. This requires
that this packet processing happens in-kernel. Since we are swapping
user-space might be waiting for this data, and we'd deadlock.


I'm not quite sure why you need per socket reservations.

2007-08-03 13:50:30

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: Distributed storage.

On Fri, Aug 03, 2007 at 02:27:52PM +0200, Peter Zijlstra ([email protected]) wrote:
> On Fri, 2007-08-03 at 14:57 +0400, Evgeniy Polyakov wrote:
>
> > For receiving situation is worse, since system does not know in advance
> > to which socket given packet will belong to, so it must allocate from
> > global pool (and thus there must be independent global reserve), and
> > then exchange part of the socket's reserve to the global one (or just
> > copy packet to the new one, allocated from socket's reseve is it was
> > setup, or drop it otherwise). Global independent reserve is what I
> > proposed when stopped to advertise network allocator, but it seems that
> > it was not taken into account, and reserve was always allocated only
> > when system has serious memory pressure in Peter's patches without any
> > meaning for per-socket reservation.
>
> This is not true. I have a global reserve which is set-up a priori. You
> cannot allocate a reserve when under pressure, that does not make sense.

I probably did not cut enough details - my main position is to allocate
per socket reserve from socket's queue, and copy data there from main
reserve, all of which are allocated either in advance (global one) or
per sockoption, so that there would be no fairness issues what to mark
as special and what to not.

Say we have a page per socket, each socket can assign a reserve for
itself from own memory, this accounts both tx and rx side. Tx is not
interesting, it is simple, rx has global reserve (always allocated on
startup or sometime way before reclaim/oom)where data is originally
received (including skb, shared info and whatever is needed, page is
just an exmaple), then it is copied into per-socket reserve and reused
for the next packet. Having per-socket reserve allows to have progress
in any situation not only in cases where single action must be
received/processed, and allows to be completely fair for all users, but
not only special sockets, thus admin for example would be allowed to
login, ipsec would work and so on...

--
Evgeniy Polyakov

2007-08-03 14:53:52

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Distributed storage.

On Fri, 2007-08-03 at 17:49 +0400, Evgeniy Polyakov wrote:
> On Fri, Aug 03, 2007 at 02:27:52PM +0200, Peter Zijlstra ([email protected]) wrote:
> > On Fri, 2007-08-03 at 14:57 +0400, Evgeniy Polyakov wrote:
> >
> > > For receiving situation is worse, since system does not know in advance
> > > to which socket given packet will belong to, so it must allocate from
> > > global pool (and thus there must be independent global reserve), and
> > > then exchange part of the socket's reserve to the global one (or just
> > > copy packet to the new one, allocated from socket's reseve is it was
> > > setup, or drop it otherwise). Global independent reserve is what I
> > > proposed when stopped to advertise network allocator, but it seems that
> > > it was not taken into account, and reserve was always allocated only
> > > when system has serious memory pressure in Peter's patches without any
> > > meaning for per-socket reservation.
> >
> > This is not true. I have a global reserve which is set-up a priori. You
> > cannot allocate a reserve when under pressure, that does not make sense.
>
> I probably did not cut enough details - my main position is to allocate
> per socket reserve from socket's queue, and copy data there from main
> reserve, all of which are allocated either in advance (global one) or
> per sockoption, so that there would be no fairness issues what to mark
> as special and what to not.
>
> Say we have a page per socket, each socket can assign a reserve for
> itself from own memory, this accounts both tx and rx side. Tx is not
> interesting, it is simple, rx has global reserve (always allocated on
> startup or sometime way before reclaim/oom)where data is originally
> received (including skb, shared info and whatever is needed, page is
> just an exmaple), then it is copied into per-socket reserve and reused
> for the next packet. Having per-socket reserve allows to have progress
> in any situation not only in cases where single action must be
> received/processed, and allows to be completely fair for all users, but
> not only special sockets, thus admin for example would be allowed to
> login, ipsec would work and so on...


Ah, I think I understand now. Yes this is indeed a good idea!

It would be quite doable to implement this on top of that I already
have. We would need to extend the socket with a sock_opt that would
reserve a specified amount of data for that specific socket. And then on
socket demux check if the socket has a non zero reserve and has not yet
exceeded said reserve. If so, process the packet.

This would also quite neatly work for -rt where we would not want
incomming packet processing to be delayed by memory allocations.

2007-08-03 19:42:33

by Daniel Phillips

[permalink] [raw]
Subject: Re: Distributed storage.

On Friday 03 August 2007 06:49, Evgeniy Polyakov wrote:
> ...rx has global reserve (always allocated on
> startup or sometime way before reclaim/oom)where data is originally
> received (including skb, shared info and whatever is needed, page is
> just an exmaple), then it is copied into per-socket reserve and
> reused for the next packet. Having per-socket reserve allows to have
> progress in any situation not only in cases where single action must
> be received/processed, and allows to be completely fair for all
> users, but not only special sockets, thus admin for example would be
> allowed to login, ipsec would work and so on...

And when the global reserve is entirely used up your system goes back to
dropping vm writeout acknowledgements, not so good. I like your
approach, and specifically the copying idea cuts out considerable
complexity. But I believe the per-socket flag to mark a socket as part
of the vm writeout path is not optional, and in this case it will be a
better world if it is a slightly unfair world in favor of vm writeout
traffic.

Ssh will still work fine even with vm getting priority access to the
pool. During memory crunches, non-vm ssh traffic may get bumped till
after the crunch, but vm writeout is never supposed to hog the whole
machine. If vm writeout hogs your machine long enough to delay an ssh
login then that is a vm bug and should be fixed at that level.

Regards,

Daniel

2007-08-03 19:48:50

by Daniel Phillips

[permalink] [raw]
Subject: Re: Distributed storage.

On Friday 03 August 2007 07:53, Peter Zijlstra wrote:
> On Fri, 2007-08-03 at 17:49 +0400, Evgeniy Polyakov wrote:
> > On Fri, Aug 03, 2007 at 02:27:52PM +0200, Peter Zijlstra wrote:
> > ...my main position is to
> > allocate per socket reserve from socket's queue, and copy data
> > there from main reserve, all of which are allocated either in
> > advance (global one) or per sockoption, so that there would be no
> > fairness issues what to mark as special and what to not.
> >
> > Say we have a page per socket, each socket can assign a reserve for
> > itself from own memory, this accounts both tx and rx side. Tx is
> > not interesting, it is simple, rx has global reserve (always
> > allocated on startup or sometime way before reclaim/oom)where data
> > is originally received (including skb, shared info and whatever is
> > needed, page is just an exmaple), then it is copied into per-socket
> > reserve and reused for the next packet. Having per-socket reserve
> > allows to have progress in any situation not only in cases where
> > single action must be received/processed, and allows to be
> > completely fair for all users, but not only special sockets, thus
> > admin for example would be allowed to login, ipsec would work and
> > so on...
>
> Ah, I think I understand now. Yes this is indeed a good idea!
>
> It would be quite doable to implement this on top of that I already
> have. We would need to extend the socket with a sock_opt that would
> reserve a specified amount of data for that specific socket. And then
> on socket demux check if the socket has a non zero reserve and has
> not yet exceeded said reserve. If so, process the packet.
>
> This would also quite neatly work for -rt where we would not want
> incomming packet processing to be delayed by memory allocations.

At this point we need "anything that works" in mainline as a starting
point. By erring on the side of simplicity we can make this
understandable for folks who haven't spent the last two years wallowing
in it. The page per socket approach is about as simple as it gets. I
therefore propose we save our premature optimizations for later.

It will also help our cause if we keep any new internal APIs to strictly
what is needed to make deadlock go away. Not a whole lot more than
just the flag to mark a socket as part of the vm writeout path when you
get right down to essentials.

Regards,

Daniel

2007-08-04 01:19:35

by Daniel Phillips

[permalink] [raw]
Subject: Re: Distributed storage.

On Friday 03 August 2007 03:26, Evgeniy Polyakov wrote:
> On Thu, Aug 02, 2007 at 02:08:24PM -0700, I wrote:
> > I see bits that worry me, e.g.:
> >
> > + req = mempool_alloc(st->w->req_pool, GFP_NOIO);
> >
> > which seems to be callable in response to a local request, just the
> > case where NBD deadlocks. Your mempool strategy can work reliably
> > only if you can prove that the pool allocations of the maximum
> > number of requests you can have in flight do not exceed the size of
> > the pool. In other words, if you ever take the pool's fallback
> > path to normal allocation, you risk deadlock.
>
> mempool should be allocated to be able to catch up with maximum
> in-flight requests, in my tests I was unable to force block layer to
> put more than 31 pages in sync, but in one bio. Each request is
> essentially dealyed bio processing, so this must handle maximum
> number of in-flight bios (if they do not cover multiple nodes, if
> they do, then each node requires own request).

It depends on the characteristics of the physical and virtual block
devices involved. Slow block devices can produce surprising effects.
Ddsnap still qualifies as "slow" under certain circumstances (big
linear write immediately following a new snapshot). Before we added
throttling we would see as many as 800,000 bios in flight. Nice to
know the system can actually survive this... mostly. But memory
deadlock is a clear and present danger under those conditions and we
did hit it (not to mention that read latency sucked beyond belief).

Anyway, we added a simple counting semaphore to throttle the bio traffic
to a reasonable number and behavior became much nicer, but most
importantly, this satisfies one of the primary requirements for
avoiding block device memory deadlock: a strictly bounded amount of bio
traffic in flight. In fact, we allow some bounded number of
non-memalloc bios *plus* however much traffic the mm wants to throw at
us in memalloc mode, on the assumption that the mm knows what it is
doing and imposes its own bound of in flight bios per device. This
needs auditing obviously, but the mm either does that or is buggy. In
practice, with this throttling in place we never saw more than 2,000 in
flight no matter how hard we hit it, which is about the number we were
aiming at. Since we draw our reserve from the main memalloc pool, we
can easily handle 2,000 bios in flight, even under extreme conditions.

See:
http://zumastor.googlecode.com/svn/trunk/ddsnap/kernel/dm-ddsnap.c
down(&info->throttle_sem);

To be sure, I am not very proud of this throttling mechanism for various
reasons, but the thing is, _any_ throttling mechanism no matter how
sucky solves the deadlock problem. Over time I want to move the
throttling up into bio submission proper, or perhaps incorporate it in
device mapper's queue function, not quite as high up the food chain.
Only some stupid little logistical issues stopped me from doing it one
of those ways right from the start. I think Peter has also tried some
things in this area. Anyway, that part is not pressing because the
throttling can be done in the virtual device itself as we do it, even
if it is not very pretty there. The point is: you have to throttle the
bio traffic. The alternative is to die a horrible death under
conditions that may be rare, but _will_ hit somebody.

Regards,

Daniel

2007-08-04 16:41:02

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: Distributed storage.

On Fri, Aug 03, 2007 at 06:19:16PM -0700, Daniel Phillips ([email protected]) wrote:
> It depends on the characteristics of the physical and virtual block
> devices involved. Slow block devices can produce surprising effects.
> Ddsnap still qualifies as "slow" under certain circumstances (big
> linear write immediately following a new snapshot). Before we added
> throttling we would see as many as 800,000 bios in flight. Nice to

Mmm, sounds tasty to work with such a system :)

> know the system can actually survive this... mostly. But memory
> deadlock is a clear and present danger under those conditions and we
> did hit it (not to mention that read latency sucked beyond belief).
>
> Anyway, we added a simple counting semaphore to throttle the bio traffic
> to a reasonable number and behavior became much nicer, but most
> importantly, this satisfies one of the primary requirements for
> avoiding block device memory deadlock: a strictly bounded amount of bio
> traffic in flight. In fact, we allow some bounded number of
> non-memalloc bios *plus* however much traffic the mm wants to throw at
> us in memalloc mode, on the assumption that the mm knows what it is
> doing and imposes its own bound of in flight bios per device. This
> needs auditing obviously, but the mm either does that or is buggy. In
> practice, with this throttling in place we never saw more than 2,000 in
> flight no matter how hard we hit it, which is about the number we were
> aiming at. Since we draw our reserve from the main memalloc pool, we
> can easily handle 2,000 bios in flight, even under extreme conditions.
>
> See:
> http://zumastor.googlecode.com/svn/trunk/ddsnap/kernel/dm-ddsnap.c
> down(&info->throttle_sem);
>
> To be sure, I am not very proud of this throttling mechanism for various
> reasons, but the thing is, _any_ throttling mechanism no matter how
> sucky solves the deadlock problem. Over time I want to move the

make_request_fn is always called in process context, we can wait in it
for memory in mempool. Although that means we already in trouble.

I agree, any kind of high-boundary leveling must be implemented in
device itself, since block layer does not know what device is at the end
and what it will need to process given block request.

--
Evgeniy Polyakov

2007-08-05 08:04:54

by Daniel Phillips

[permalink] [raw]
Subject: Re: Distributed storage.

On Saturday 04 August 2007 09:37, Evgeniy Polyakov wrote:
> On Fri, Aug 03, 2007 at 06:19:16PM -0700, I wrote:
> > To be sure, I am not very proud of this throttling mechanism for
> > various reasons, but the thing is, _any_ throttling mechanism no
> > matter how sucky solves the deadlock problem. Over time I want to
> > move the
>
> make_request_fn is always called in process context,

Yes, as is submit_bio which calls it. The decision re where it is best
to throttle, in submit_bio or in make_request_fn, has more to do with
system factoring, that is, is throttling something that _every_ block
device should have (yes I think) or is it a delicate, optional thing
that needs a tweakable algorithm per block device type (no I think).

The big worry I had was that by blocking on congestion in the
submit_bio/make_request_fn I might stuff up system-wide mm writeout.
But a while ago that part of the mm was tweaked (by Andrew if I recall
correctly) to use a pool of writeout threads and understand the concept
of one of them blocking on some block device, and not submit more
writeout to the same block device until the first thread finishes its
submission. Meanwhile, other mm writeout threads carry on with other
block devices.

> we can wait in it for memory in mempool. Although that means we
> already in trouble.

Not at all. This whole block writeout path needs to be written to run
efficiently even when normal system memory is completely gone. All it
means when we wait on a mempool is that the block device queue is as
full as we are ever going to let it become, and that means the block
device is working as hard as it can (subject to a small caveat: for
some loads a device can work more efficiently if it can queue up larger
numbers of requests down at the physical elevators).

By the way, ddsnap waits on a counting semaphore, not a mempool. That
is because we draw our reserve memory from the global memalloc reserve,
not from a mempool. And that is not only because it takes less code to
do so, but mainly because global pools as opposed to lots of little
special purpose pools seem like a good idea to me. Though I will admit
that with our current scheme we need to allow for the total of the
maximum reserve requirements for all memalloc users in the memalloc
pool, so it does not actually save any memory vs dedicated pools. We
could improve that if we wanted to, by having hard and soft reserve
requirements: the global reserve actually only needs to be as big as
the total of the hard requirements. With this idea, if by some unlucky
accident every single pool user got itself maxed out at the same time,
we would still not exceed our share of the global reserve.
Under "normal" low memory situations, a block device would typically be
free to grab reserve memory up to its soft limit, allowing it to
optimize over a wider range of queued transactions. My little idea
here is: allocating specific pages to a pool is kind of dumb, all we
really want to do is account precisely for the number of pages we are
allowed to draw from the global reserve.

OK, I kind of digressed, but this all counts as explaining the details
of what Peter and I have been up to for the last year (longer for me).
At this point, we don't need to do the reserve accounting in the most
absolutely perfect way possible, we just need to get something minimal
in place to fix the current deadlock problems, then we can iteratively
improve it.

> I agree, any kind of high-boundary leveling must be implemented in
> device itself, since block layer does not know what device is at the
> end and what it will need to process given block request.

I did not say the throttling has to be implemented in the device, only
that we did it there because it was easiest to code that up and try it
out (it worked). This throttling really wants to live at a higher
level, possibly submit_bio()...bio->endio(). Someone at OLS (James
Bottomley?) suggested it would be better done at the request queue
layer, but I do not immediately see why that should be. I guess this
is going to come down to somebody throwing out a patch for interested
folks to poke at. But this detail is a fine point. The big point is
to have _some_ throttling mechanism in place on the block IO path,
always.

Device mapper in particular does not have any throttling itself: calling
submit_bio on a device mapper device directly calls the device mapper
bio dispatcher. Default initialized block device queue do provide a
crude form of throttling based on limiting the number of requests.
This is insufficiently precise to do a good job in the long run, but it
works for now because the current gaggle of low level block drivers do
not have a lot of resource requirements and tend to behave fairly
predictably (except for some irritating issues re very slow devices
working in parallel with very fast devices, but... worry about that
later). Network block drivers - for example your driver - do have
nontrivial resource requirements and they do not, as far as I can see,
have any form of throttling on the upstream side. So unless you can
demonstrate I'm wrong (I would be more than happy about that) then we
are going to need to add some.

Anyway, I digressed again. _Every_ layer in a block IO stack needs to
have a reserve, if it consumes memory. So we can't escape the question
of how big to make those reserves by trying to push it all down to the
lowest level, hoping that the low level device knows more about how
many requests it will have in flight. For the time being, we will just
plug in some seat of the pants numbers in classic Linux fashion and
that will serve us until somebody gets around to inventing the one true
path discovery mechanism that can sniff around in the block IO stack
and figure out the optimal amount of system resources to reserve at
each level, which ought to be worth at least a master's thesis for
somebody.

Regards,

Daniel

2007-08-05 15:09:21

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: Distributed storage.

Hi Daniel.

On Sun, Aug 05, 2007 at 01:04:19AM -0700, Daniel Phillips ([email protected]) wrote:
> > we can wait in it for memory in mempool. Although that means we
> > already in trouble.
>
> Not at all. This whole block writeout path needs to be written to run
> efficiently even when normal system memory is completely gone. All it
> means when we wait on a mempool is that the block device queue is as
> full as we are ever going to let it become, and that means the block
> device is working as hard as it can (subject to a small caveat: for
> some loads a device can work more efficiently if it can queue up larger
> numbers of requests down at the physical elevators).

If we are sleeping in memory pool, then we already do not have memory to
complete previous requests, so we are in trouble. This can work for
devices which do not require additional allocations (like usual local
storage), but not for network connected ones.

> > I agree, any kind of high-boundary leveling must be implemented in
> > device itself, since block layer does not know what device is at the
> > end and what it will need to process given block request.
>
> I did not say the throttling has to be implemented in the device, only
> that we did it there because it was easiest to code that up and try it
> out (it worked). This throttling really wants to live at a higher
> level, possibly submit_bio()...bio->endio(). Someone at OLS (James
> Bottomley?) suggested it would be better done at the request queue
> layer, but I do not immediately see why that should be. I guess this
> is going to come down to somebody throwing out a patch for interested
> folks to poke at. But this detail is a fine point. The big point is
> to have _some_ throttling mechanism in place on the block IO path,
> always.

If not in device, then at least it should say to block layer about its
limits. What about new function to register queue which will get maximum
number of bios in flight and sleep in generic_make_request() when new
bio is going to be submitted and it is about to exceed the limit?

By default things will be like they are now, except additional
non-atomic increment and branch in generic_make_request() and decrement
and wake in bio_end_io()?

I can cook up such a patch if idea worth efforts.

--
Evgeniy Polyakov

2007-08-05 21:24:04

by Daniel Phillips

[permalink] [raw]
Subject: Re: Distributed storage.

On Sunday 05 August 2007 08:08, Evgeniy Polyakov wrote:
> If we are sleeping in memory pool, then we already do not have memory
> to complete previous requests, so we are in trouble.

Not at all. Any requests in flight are guaranteed to get the resources
they need to complete. This is guaranteed by the combination of memory
reserve management and request queue throttling. In logical terms,
reserve management plus queue throttling is necessary and sufficient to
prevent these deadlocks. Conversely, the absence of either one allows
deadlock.

> This can work
> for devices which do not require additional allocations (like usual
> local storage), but not for network connected ones.

It works for network devices too, and also for a fancy device like
ddsnap, which is the moral equivalent of a filesystem implemented in
user space.

> If not in device, then at least it should say to block layer about
> its limits. What about new function to register queue...

Yes, a new internal API is needed eventually. However, no new api is
needed right at the moment because we can just hard code the reserve
sizes and queue limits and audit them by hand, which is not any more
sloppy than several other kernel subsystems. The thing is, we need to
keep any obfuscating detail out of the initial patches because these
principles are hard enough to explain already without burying them in
hundreds of lines of API fluff.

That said, the new improved API should probably not be a new way to
register, but a set of function calls you can use after the queue is
created, which follows the pattern of the existing queue API.

> ...which will get
> maximum number of bios in flight and sleep in generic_make_request()
> when new bio is going to be submitted and it is about to exceed the
> limit?

Exactly. This is what ddsnap currently does and it works. But we did
not change generic_make_request for this driver, instead we throttled
the driver from the time it makes a request to its user space server,
until the reply comes back. We did it that way because it was easy and
was the only segment of the request lifeline that could not be fixed by
other means. A proper solution for all block devices will move the
throttling up into generic_make_request, as you say below.

> By default things will be like they are now, except additional
> non-atomic increment and branch in generic_make_request() and
> decrement and wake in bio_end_io()?

->endio is called in interrupt context, so the accounting needs to be
atomic as far as I can see.

We actually account the total number of bio pages in flight, otherwise
you would need to assume the largest possible bio and waste a huge
amount of reserve memory. A counting semaphore works fine for this
purpose, with some slight inefficiency that is nigh on unmeasurable in
the block IO path. What the semaphore does is make the patch small and
easy to understand, which is important at this point.

> I can cook up such a patch if idea worth efforts.

It is. There are some messy details... You need a place to store the
accounting variable/semaphore and need to be able to find that place
again in ->endio. Trickier than it sounds, because of the unstructured
way drivers rewrite ->bi_bdev. Peterz has already poked at this in a
number of different ways, typically involving backing_dev_info, which
seems like a good idea to me.

A simple way to solve the stable accounting field issue is to add a new
pointer to struct bio that is owned by the top level submitter
(normally generic_make_request but not always) and is not affected by
any recursive resubmission. Then getting rid of that field later
becomes somebody's summer project, which is not all that urgent because
struct bio is already bloated up with a bunch of dubious fields and is
a transient structure anyway.

Regards,

Daniel

2007-08-06 08:28:05

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: Distributed storage.

On Sun, Aug 05, 2007 at 02:23:45PM -0700, Daniel Phillips ([email protected]) wrote:
> On Sunday 05 August 2007 08:08, Evgeniy Polyakov wrote:
> > If we are sleeping in memory pool, then we already do not have memory
> > to complete previous requests, so we are in trouble.
>
> Not at all. Any requests in flight are guaranteed to get the resources
> they need to complete. This is guaranteed by the combination of memory
> reserve management and request queue throttling. In logical terms,
> reserve management plus queue throttling is necessary and sufficient to
> prevent these deadlocks. Conversely, the absence of either one allows
> deadlock.

Only if you have two, which must be closely related to each other (i.e.
each request must have network reserve big enough to store data).

> > This can work
> > for devices which do not require additional allocations (like usual
> > local storage), but not for network connected ones.
>
> It works for network devices too, and also for a fancy device like
> ddsnap, which is the moral equivalent of a filesystem implemented in
> user space.

With or without vm deadlock patches? I can not see how it can work, if
network does not have a reserve and there is not free memory completely.
If all systems have reserve then yes, it works good.

> > By default things will be like they are now, except additional
> > non-atomic increment and branch in generic_make_request() and
> > decrement and wake in bio_end_io()?
>
> ->endio is called in interrupt context, so the accounting needs to be
> atomic as far as I can see.

Actually we only care about if there is a place in the queue or not - so
it can be a flag. Actually non-atomic operations are ok, since having
plus/minus couple of requests in flight does not change the picture, but
allows not to introduce slow atomic operations in the fast path.

> We actually account the total number of bio pages in flight, otherwise
> you would need to assume the largest possible bio and waste a huge
> amount of reserve memory. A counting semaphore works fine for this
> purpose, with some slight inefficiency that is nigh on unmeasurable in
> the block IO path. What the semaphore does is make the patch small and
> easy to understand, which is important at this point.

Yes, it can be bio vectors.

> > I can cook up such a patch if idea worth efforts.
>
> It is. There are some messy details... You need a place to store the
> accounting variable/semaphore and need to be able to find that place
> again in ->endio. Trickier than it sounds, because of the unstructured
> way drivers rewrite ->bi_bdev. Peterz has already poked at this in a
> number of different ways, typically involving backing_dev_info, which
> seems like a good idea to me.

We can demand that reserve is not per virtual device, but per real one -
for example in case of distributed storage locally connected node should
have much higher limit than network one, but having a per-virtual device
reserve might end up with situation, when local node can proceed data,
but no requests will be queued sine all requests below limit are in
network node. In case of per real device limit there is no need to
increase bio.

--
Evgeniy Polyakov

2007-08-07 12:05:36

by Jens Axboe

[permalink] [raw]
Subject: Re: Distributed storage.

On Sun, Aug 05 2007, Daniel Phillips wrote:
> A simple way to solve the stable accounting field issue is to add a new
> pointer to struct bio that is owned by the top level submitter
> (normally generic_make_request but not always) and is not affected by
> any recursive resubmission. Then getting rid of that field later
> becomes somebody's summer project, which is not all that urgent because
> struct bio is already bloated up with a bunch of dubious fields and is
> a transient structure anyway.

Thanks for your insights. Care to detail what bloat and dubious fields
struct bio has?

And we don't add temporary fields out of laziness, hoping that "someone"
will later kill it again and rewrite it in a nicer fashion. Hint: that
never happens, bloat sticks.

--
Jens Axboe

2007-08-07 18:25:29

by Daniel Phillips

[permalink] [raw]
Subject: Re: Distributed storage.

On Tuesday 07 August 2007 05:05, Jens Axboe wrote:
> On Sun, Aug 05 2007, Daniel Phillips wrote:
> > A simple way to solve the stable accounting field issue is to add a
> > new pointer to struct bio that is owned by the top level submitter
> > (normally generic_make_request but not always) and is not affected
> > by any recursive resubmission. Then getting rid of that field
> > later becomes somebody's summer project, which is not all that
> > urgent because struct bio is already bloated up with a bunch of
> > dubious fields and is a transient structure anyway.
>
> Thanks for your insights. Care to detail what bloat and dubious
> fields struct bio has?

First obvious one I see is bi_rw separate from bi_flags. Front_size and
back_size smell dubious. Is max_vecs really necessary? You could
reasonably assume bi_vcnt rounded up to a power of two and bury the
details of making that work behind wrapper functions to change the
number of bvecs, if anybody actually needs that. Bi_endio and
bi_destructor could be combined. I don't see a lot of users of bi_idx,
that looks like a soft target. See what happened to struct page when a
couple of folks got serious about attacking it, some really deep hacks
were done to pare off a few bytes here and there. But struct bio as a
space waster is not nearly in the same ballpark.

It would be interesting to see if bi_bdev could be made read only.
Generally, each stage in the block device stack knows what the next
stage is going to be, so why do we have to write that in the bio? For
error reporting from interrupt context? Anyway, if Evgeniy wants to do
the patch, I will happily unload the task of convincing you that random
fields are/are not needed in struct bio :-)

Regards,

Daniel

2007-08-07 21:19:47

by Jens Axboe

[permalink] [raw]
Subject: Re: Distributed storage.

On Tue, Aug 07 2007, Daniel Phillips wrote:
> On Tuesday 07 August 2007 05:05, Jens Axboe wrote:
> > On Sun, Aug 05 2007, Daniel Phillips wrote:
> > > A simple way to solve the stable accounting field issue is to add a
> > > new pointer to struct bio that is owned by the top level submitter
> > > (normally generic_make_request but not always) and is not affected
> > > by any recursive resubmission. Then getting rid of that field
> > > later becomes somebody's summer project, which is not all that
> > > urgent because struct bio is already bloated up with a bunch of
> > > dubious fields and is a transient structure anyway.
> >
> > Thanks for your insights. Care to detail what bloat and dubious
> > fields struct bio has?
>
> First obvious one I see is bi_rw separate from bi_flags. Front_size and
> back_size smell dubious. Is max_vecs really necessary? You could

I don't like structure bloat, but I do like nice design. Overloading is
a necessary evil sometimes, though. Even today, there isn't enough room
to hold bi_rw and bi_flags in the same variable on 32-bit archs, so that
concern can be scratched. If you read bio.h, that much is obvious.

If you check up on the iommu virtual merging, you'll understand the
front and back size members. They may smell dubious to you, but please
take the time to understand why it looks the way it does.

> reasonably assume bi_vcnt rounded up to a power of two and bury the
> details of making that work behind wrapper functions to change the
> number of bvecs, if anybody actually needs that. Bi_endio and

Changing the number of bvecs is integral to how bio buildup current
works.

> bi_destructor could be combined. I don't see a lot of users of bi_idx,

bi_idx is integral to partial io completions.

> that looks like a soft target. See what happened to struct page when a
> couple of folks got serious about attacking it, some really deep hacks
> were done to pare off a few bytes here and there. But struct bio as a
> space waster is not nearly in the same ballpark.

So show some concrete patches and examples, hand waving and assumptions
is just a waste of everyones time.

> It would be interesting to see if bi_bdev could be made read only.
> Generally, each stage in the block device stack knows what the next
> stage is going to be, so why do we have to write that in the bio? For
> error reporting from interrupt context? Anyway, if Evgeniy wants to do
> the patch, I will happily unload the task of convincing you that random
> fields are/are not needed in struct bio :-)

It's a trade off, otherwise you'd have to pass the block device around a
lot. And it's, again, a design issue. A bio contains destination
information, that means device/offset/size information. I'm all for
shaving structure bytes where it matters, but not for the sake of
sacrificing code stability or design. I consider struct bio quite lean
and have worked hard to keep it that way. In fact, iirc, the only
addition to struct bio since 2001 is the iommu front/back size members.
And I resisted those for quite a while.

--
Jens Axboe

2007-08-08 09:56:04

by Evgeniy Polyakov

[permalink] [raw]
Subject: Block device throttling [Re: Distributed storage.]

On Tue, Aug 07, 2007 at 10:55:38PM +0200, Jens Axboe ([email protected]) wrote:
> I don't like structure bloat, but I do like nice design. Overloading is

So, what did we decide? To bloat bio a bit (add a queue pointer) or to
use physical device limits? The latter requires to replace all occurence
of bio->bi_bdev = something_new with blk_set_bdev(bio, somthing_new),
where queue limits will be appropriately charged. So far I'm testing
second case, but I only changed DST for testing, can change all other
users if needed though.

--
Evgeniy Polyakov

2007-08-08 10:17:49

by Evgeniy Polyakov

[permalink] [raw]
Subject: [1/1] Block device throttling [Re: Distributed storage.]

This throttling mechanism allows to limit maximum amount of queued bios
per physical device. By default it is turned off and old block layer
behaviour with unlimited number of bios is used. When turned on (queue
limit is set to something different than -1U via blk_set_queue_limit()),
generic_make_request() will sleep until there is room in the queue.
number of bios is increased in generic_make_request() and reduced either
in bio_endio(), when bio is completely processed (bi_size is zero), and
recharged from original queue when new device is assigned to bio via
blk_set_bdev(). All oerations are not atomic, since we do not care about
precise number of bios, but a fact, that we are close or close enough to
the limit.

Tested on distributed storage device - with limit of 2 bios it works
slow :)

Signed-off-by: Evgeniy Polyakov <[email protected]>

diff --git a/block/ll_rw_blk.c b/block/ll_rw_blk.c
index c99b463..1882c9b 100644
--- a/block/ll_rw_blk.c
+++ b/block/ll_rw_blk.c
@@ -1851,6 +1851,10 @@ request_queue_t *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
q->backing_dev_info.unplug_io_fn = blk_backing_dev_unplug;
q->backing_dev_info.unplug_io_data = q;

+ q->bio_limit = -1U;
+ q->bio_queued = 0;
+ init_waitqueue_head(&q->wait);
+
mutex_init(&q->sysfs_lock);

return q;
@@ -3237,6 +3241,16 @@ end_io:
*/
void generic_make_request(struct bio *bio)
{
+ request_queue_t *q;
+
+ BUG_ON(!bio->bi_bdev)
+
+ q = bdev_get_queue(bio->bi_bdev);
+ if (q && q->bio_limit != -1U) {
+ wait_event_interruptible(q->wait, q->bio_queued + 1 <= q->bio_limit);
+ q->bio_queued++;
+ }
+
if (current->bio_tail) {
/* make_request is active */
*(current->bio_tail) = bio;
diff --git a/fs/bio.c b/fs/bio.c
index 093345f..0a33958 100644
--- a/fs/bio.c
+++ b/fs/bio.c
@@ -1028,6 +1028,16 @@ void bio_endio(struct bio *bio, unsigned int bytes_done, int error)
bio->bi_size -= bytes_done;
bio->bi_sector += (bytes_done >> 9);

+ if (!bio->bi_size && bio->bi_bdev) {
+ request_queue_t *q;
+
+ q = bdev_get_queue(bio->bi_bdev);
+ if (q) {
+ q->bio_queued--;
+ wake_up(&q->wait);
+ }
+ }
+
if (bio->bi_end_io)
bio->bi_end_io(bio, bytes_done, error);
}
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index db5b00a..7ce0cd7 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -467,6 +467,9 @@ struct request_queue
struct request *orig_bar_rq;
unsigned int bi_size;

+ wait_queue_head_t wait;
+ unsigned int bio_limit, bio_queued;
+
struct mutex sysfs_lock;
};

@@ -764,6 +767,30 @@ extern long nr_blockdev_pages(void);
int blk_get_queue(request_queue_t *);
request_queue_t *blk_alloc_queue(gfp_t);
request_queue_t *blk_alloc_queue_node(gfp_t, int);
+
+static inline void blk_queue_set_limit(request_queue_t *q, unsigned int limit)
+{
+ q->bio_limit = limit;
+}
+
+static inline void blk_set_bdev(struct bio *bio, struct block_device *bdev)
+{
+ request_queue_t *q;
+
+ if (!bio->bi_bdev) {
+ bio->bi_bdev = bdev;
+ return;
+ }
+
+ q = bdev_get_queue(bio->bi_bdev);
+ if (q) {
+ q->bio_queued--;
+ wake_up(&q->wait);
+ }
+
+ bio->bi_bdev = bdev;
+}
+
extern void blk_put_queue(request_queue_t *);

/*

--
Evgeniy Polyakov

2007-08-08 13:30:16

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [1/1] Block device throttling [Re: Distributed storage.]

On Wed, Aug 08, 2007 at 02:17:09PM +0400, Evgeniy Polyakov ([email protected]) wrote:
> This throttling mechanism allows to limit maximum amount of queued bios
> per physical device. By default it is turned off and old block layer
> behaviour with unlimited number of bios is used. When turned on (queue
> limit is set to something different than -1U via blk_set_queue_limit()),
> generic_make_request() will sleep until there is room in the queue.
> number of bios is increased in generic_make_request() and reduced either
> in bio_endio(), when bio is completely processed (bi_size is zero), and
> recharged from original queue when new device is assigned to bio via
> blk_set_bdev(). All oerations are not atomic, since we do not care about
> precise number of bios, but a fact, that we are close or close enough to
> the limit.
>
> Tested on distributed storage device - with limit of 2 bios it works
> slow :)

As addon I can cook up a patch to configure this via sysfs if needed.
Thoughts?

--
Evgeniy Polyakov

2007-08-12 23:16:37

by Daniel Phillips

[permalink] [raw]
Subject: Re: [1/1] Block device throttling [Re: Distributed storage.]

Hi Evgeniy,

Sorry for not getting back to you right away, I was on the road with
limited email access. Incidentally, the reason my mails to you keep
bouncing is, your MTA is picky about my mailer's IP reversing to a real
hostname. I will take care of that pretty soon, but for now my direct
mail to you is going to bounce and you will only see the lkml copy.

On Wednesday 08 August 2007 03:17, Evgeniy Polyakov wrote:
> This throttling mechanism allows to limit maximum amount of queued
> bios per physical device. By default it is turned off and old block
> layer behaviour with unlimited number of bios is used. When turned on
> (queue limit is set to something different than -1U via
> blk_set_queue_limit()), generic_make_request() will sleep until there
> is room in the queue. number of bios is increased in
> generic_make_request() and reduced either in bio_endio(), when bio is
> completely processed (bi_size is zero), and recharged from original
> queue when new device is assigned to bio via blk_set_bdev(). All
> oerations are not atomic, since we do not care about precise number
> of bios, but a fact, that we are close or close enough to the limit.
>
> Tested on distributed storage device - with limit of 2 bios it works
> slow :)

it seems to me you need:

- if (q) {
+ if (q && q->bio_limit != -1) {

This patch is short and simple, and will throttle more accurately than
the current simplistic per-request allocation limit. However, it fails
to throttle device mapper devices. This is because no request is
allocated by the device mapper queue method, instead the mapping call
goes straight through to the mapping function. If the mapping function
allocates memory (typically the case) then this resource usage evades
throttling and deadlock becomes a risk.

There are three obvious fixes:

1) Implement bio throttling in each virtual block device
2) Implement bio throttling generically in device mapper
3) Implement bio throttling for all block devices

Number 1 is the approach we currently use in ddsnap, but it is ugly and
repetitious. Number 2 is a possibility, but I favor number 3 because
it is a system-wide solution to a system-wide problem, does not need to
be repeated for every block device that lacks a queue, heads in the
direction of code subtraction, and allows system-wide reserve
accounting.

Your patch is close to the truth, but it needs to throttle at the top
(virtual) end of each block device stack instead of the bottom
(physical) end. It does head in the direction of eliminating your own
deadlock risk indeed, however there are block devices it does not
cover.

Regards,

Daniel

2007-08-12 23:37:23

by Daniel Phillips

[permalink] [raw]
Subject: Re: Distributed storage.

On Tuesday 07 August 2007 13:55, Jens Axboe wrote:
> I don't like structure bloat, but I do like nice design. Overloading
> is a necessary evil sometimes, though. Even today, there isn't enough
> room to hold bi_rw and bi_flags in the same variable on 32-bit archs,
> so that concern can be scratched. If you read bio.h, that much is
> obvious.

Sixteen bits in bi_rw are consumed by queue priority. Is there a reason
this lives in struct bio instead of struct request?

> If you check up on the iommu virtual merging, you'll understand the
> front and back size members. They may smell dubious to you, but
> please take the time to understand why it looks the way it does.

Virtual merging is only needed at the physical device, so why do these
fields live in struct bio instead of struct request?

> Changing the number of bvecs is integral to how bio buildup current
> works.

Right, that is done by bi_vcnt. I meant bi_max_vecs, which you can
derive efficiently from BIO_POOL_IDX() provided the bio was allocated
in the standard way. This leaves a little bit of clean up to do for
bios not allocated from a standard pool.

Incidentally, why does the bvl need to be memset to zero on allocation?
bi_vcnt already tells you which bvecs are valid and the only field in a
bvec that can reasonably default to zero is the offset, which ought to
be set set every time a bvec is initialized anyway.

> > bi_destructor could be combined. I don't see a lot of users of
> > bi_idx,
>
> bi_idx is integral to partial io completions.

Struct request has a remaining submission sector count so what does
bi_idx do that is different?

> > that looks like a soft target. See what happened to struct page
> > when a couple of folks got serious about attacking it, some really
> > deep hacks were done to pare off a few bytes here and there. But
> > struct bio as a space waster is not nearly in the same ballpark.
>
> So show some concrete patches and examples, hand waving and
> assumptions is just a waste of everyones time.

Average struct bio memory footprint ranks near the bottom of the list of
things that suck most about Linux storage. At idle I see 8K in use
(reserves); during updatedb it spikes occasionally to 50K; under a
heavy load generated by ddsnap on a storage box it sometimes goes to
100K with bio throttling in place. Really not moving the needle.

On the other hand, vm writeout deadlock ranks smack dab at the top of
the list, so that is where the patching effort must go for the
forseeable future. Without bio throttling, the ddsnap load can go to
24 MB for struct bio alone. That definitely moves the needle. in
short, we save 3,200 times more memory by putting decent throttling in
place than by saving an int in struct bio.

That said, I did a little analysis to get an idea of where the soft
targets are in struct bio, and to get to know the bio layer a little
better. Maybe these few hints will get somebody interested enough to
look further.

> > It would be interesting to see if bi_bdev could be made read only.
> > Generally, each stage in the block device stack knows what the next
> > stage is going to be, so why do we have to write that in the bio?
> > For error reporting from interrupt context? Anyway, if Evgeniy
> > wants to do the patch, I will happily unload the task of convincing
> > you that random fields are/are not needed in struct bio :-)
>
> It's a trade off, otherwise you'd have to pass the block device
> around a lot.

Which costs very little, probably less than trashing an extra field's
worth of cache.

> And it's, again, a design issue. A bio contains
> destination information, that means device/offset/size information.
> I'm all for shaving structure bytes where it matters, but not for the
> sake of sacrificing code stability or design. I consider struct bio
> quite lean and have worked hard to keep it that way. In fact, iirc,
> the only addition to struct bio since 2001 is the iommu front/back
> size members. And I resisted those for quite a while.

You did not comment on the one about putting the bio destructor in
the ->endio handler, which looks dead simple. The majority of cases
just use the default endio handler and the default destructor. Of the
remaining cases, where a specialized destructor is needed, typically a
specialized endio handler is too, so combining is free. There are few
if any cases where a new specialized endio handler would need to be
written.

As far as code stability goes, current kernels are horribly unstable in
a variety of contexts because of memory deadlock and slowdowns related
to the attempt to fix the problem via dirty memory limits. Accurate
throttling of bio traffic is one of the two key requirements to fix
this instability, the other other is accurate writeout path reserve
management, which is only partially addressed by BIO_POOL.

Nice to see you jumping in Jens. Now it is over to the other side of
the thread where Evgeniy has posted a patch that a) grants your wish to
add no new field in struct bio and b) does not fix the problem.

Regards,

Daniel

2007-08-13 05:22:32

by Daniel Phillips

[permalink] [raw]
Subject: Re: Block device throttling [Re: Distributed storage.]

On Wednesday 08 August 2007 02:54, Evgeniy Polyakov wrote:
> On Tue, Aug 07, 2007 at 10:55:38PM +0200, Jens Axboe
([email protected]) wrote:
>
> So, what did we decide? To bloat bio a bit (add a queue pointer) or
> to use physical device limits? The latter requires to replace all
> occurence of bio->bi_bdev = something_new with blk_set_bdev(bio,
> somthing_new), where queue limits will be appropriately charged. So
> far I'm testing second case, but I only changed DST for testing, can
> change all other users if needed though.

Adding a queue pointer to struct bio and using physical device limits as
in your posted patch both suffer from the same problem: you release the
throttling on the previous queue when the bio moves to a new one, which
is a bug because memory consumption on the previous queue then becomes
unbounded, or limited only by the number of struct requests that can be
allocated. In other words, it reverts to the same situation we have
now as soon as the IO stack has more than one queue. (Just a shorter
version of my previous post.)

We can solve this by having the bio only point at the queue to which it
was originally submitted, since throttling the top level queue
automatically throttles all queues lower down the stack. Alternatively
the bio can point at the block_device or straight at the
backing_dev_info, which is the per-device structure it actually needs
to touch.

Note! There are two more issues I forgot to mention earlier.

1) One throttle count per submitted bio is too crude a measure. A bio
can carry as few as one page or as many as 256 pages. If you take only
one throttle count per bio and that data will be transferred over the
network then you have to assume that (a little more than) 256 pages of
sk_alloc reserve will be needed for every bio, resulting in a grossly
over-provisioned reserve. The precise reserve calculation we want to
do is per-block device, and you will find hooks like this already
living in backing_dev_info. We need to place our own fn+data there to
calculate the throttle draw for each bio. Unthrottling gets trickier
with variable size throttle draw. In ddsnap, we simply write the
amount we drew from the throttle into (the private data of) bio for use
later by unthrottle, thus avoiding the issue that the bio fields we
used to calculate might have changed during the lifetime of the bio.
This would translate into one more per-bio field.



the throttling performs another function: keeping a reasonable amount of
IO in flight for the device. The definition of "reasonable" is
complex. For a hard disk it depends on the physical distance between
sector addresses of the bios in flight. In ddsnap we make a crude but
workable approximation that


In general, a per block device

The throttle count needs to cover

Regards,

Daniel

2007-08-13 05:36:40

by Daniel Phillips

[permalink] [raw]
Subject: Re: Block device throttling [Re: Distributed storage.]

(previous incomplete message sent accidentally)

On Wednesday 08 August 2007 02:54, Evgeniy Polyakov wrote:
> On Tue, Aug 07, 2007 at 10:55:38PM +0200, Jens Axboe wrote:
>
> So, what did we decide? To bloat bio a bit (add a queue pointer) or
> to use physical device limits? The latter requires to replace all
> occurence of bio->bi_bdev = something_new with blk_set_bdev(bio,
> somthing_new), where queue limits will be appropriately charged. So
> far I'm testing second case, but I only changed DST for testing, can
> change all other users if needed though.

Adding a queue pointer to struct bio and using physical device limits as
in your posted patch both suffer from the same problem: you release the
throttling on the previous queue when the bio moves to a new one, which
is a bug because memory consumption on the previous queue then becomes
unbounded, or limited only by the number of struct requests that can be
allocated. In other words, it reverts to the same situation we have
now as soon as the IO stack has more than one queue. (Just a shorter
version of my previous post.)

We can solve this by having the bio only point at the queue to which it
was originally submitted, since throttling the top level queue
automatically throttles all queues lower down the stack. Alternatively
the bio can point at the block_device or straight at the
backing_dev_info, which is the per-device structure it actually needs
to touch.

Note! There are two more issues I forgot to mention earlier.

1) One throttle count per submitted bio is too crude a measure. A bio
can carry as few as one page or as many as 256 pages. If you take only
one throttle count per bio and that data will be transferred over the
network then you have to assume that (a little more than) 256 pages of
sk_alloc reserve will be needed for every bio, resulting in a grossly
over-provisioned reserve. The precise reserve calculation we want to
do is per-block device, and you will find hooks like this already
living in backing_dev_info. We need to place our own fn+data there to
calculate the throttle draw for each bio. Unthrottling gets trickier
with variable size throttle draw. In ddsnap, we simply write the
amount we drew from the throttle into (the private data of) bio for use
later by unthrottle, thus avoiding the issue that the bio fields we
used to calculate might have changed during the lifetime of the bio.
This would translate into one more per-bio field.

2) Exposing the per-block device throttle limits via sysfs or similar is
really not a good long term solution for system administration.
Imagine our help text: "just keep trying smaller numbers until your
system deadlocks". We really need to figure this out internally and
get it correct. I can see putting in a temporary userspace interface
just for experimentation, to help determine what really is safe, and
what size the numbers should be to approach optimal throughput in a
fully loaded memory state.

Regards,

Daniel


Regards,

Daniel

2007-08-13 12:21:20

by Daniel Phillips

[permalink] [raw]
Subject: Re: Block device throttling [Re: Distributed storage.]

On Sunday 12 August 2007 22:36, I wrote:
> Note! There are two more issues I forgot to mention earlier.

Oops, and there is also:

3) The bio throttle, which is supposed to prevent deadlock, can itself
deadlock. Let me see if I can remember how it goes.

* generic_make_request puts a bio in flight
* the bio gets past the throttle and initiates network IO
* net calls sk_alloc->alloc_pages->shrink_caches
* shrink_caches submits a bio recursively to our block device
* this bio blocks on the throttle
* net may never get the memory it needs, and we are wedged

I need to review a backtrace to get this precisely right, however you
can see the danger. In ddsnap we kludge around this problem by not
throttling any bio submitted in PF_MEMALLOC mode, which effectively
increases our reserve requirement by the amount of IO that mm will
submit to a given block device before deciding the device is congested
and should be left alone. This works, but is sloppy and disgusting.

The right thing to do is to make sure than the mm knows about our
throttle accounting in backing_dev_info so it will not push IO to our
device when it knows that the IO will just block on congestion.
Instead, shrink_caches will find some other less congested block device
or give up, causing alloc_pages to draw from the memalloc reserve to
satisfy the sk_alloc request.

The mm already uses backing_dev_info this way, we just need to set the
right bits in the backing_dev_info state flags. I think Peter posted a
patch set that included this feature at some point.

Regards,

Daniel

2007-08-13 12:46:18

by Jens Axboe

[permalink] [raw]
Subject: Re: Distributed storage.

On Sun, Aug 12 2007, Daniel Phillips wrote:
> On Tuesday 07 August 2007 13:55, Jens Axboe wrote:
> > I don't like structure bloat, but I do like nice design. Overloading
> > is a necessary evil sometimes, though. Even today, there isn't enough
> > room to hold bi_rw and bi_flags in the same variable on 32-bit archs,
> > so that concern can be scratched. If you read bio.h, that much is
> > obvious.
>
> Sixteen bits in bi_rw are consumed by queue priority. Is there a reason
> this lives in struct bio instead of struct request?

If you don't, you have to pass them down. You can make that very
statement about basically any member of struct bio, until we end up with
a submit_bio() path and down taking 16 arguments.

> > If you check up on the iommu virtual merging, you'll understand the
> > front and back size members. They may smell dubious to you, but
> > please take the time to understand why it looks the way it does.
>
> Virtual merging is only needed at the physical device, so why do these
> fields live in struct bio instead of struct request?

A bio does exist outside of a struct request, and bio buildup also
happens before it gets attached to such.

> > Changing the number of bvecs is integral to how bio buildup current
> > works.
>
> Right, that is done by bi_vcnt. I meant bi_max_vecs, which you can
> derive efficiently from BIO_POOL_IDX() provided the bio was allocated
> in the standard way.

That would only be feasible, if we ruled that any bio in the system must
originate from the standard pools.

> This leaves a little bit of clean up to do for bios not allocated from
> a standard pool.

Please suggest how to do such a cleanup.

> Incidentally, why does the bvl need to be memset to zero on allocation?
> bi_vcnt already tells you which bvecs are valid and the only field in a
> bvec that can reasonably default to zero is the offset, which ought to
> be set set every time a bvec is initialized anyway.

We could probably skip that, but that's an unrelated subject.

> > > bi_destructor could be combined. I don't see a lot of users of
> > > bi_idx,
> >
> > bi_idx is integral to partial io completions.
>
> Struct request has a remaining submission sector count so what does
> bi_idx do that is different?

Struct request has remaining IO count. You still need to know where to
start in the bio.

> > > that looks like a soft target. See what happened to struct page
> > > when a couple of folks got serious about attacking it, some really
> > > deep hacks were done to pare off a few bytes here and there. But
> > > struct bio as a space waster is not nearly in the same ballpark.
> >
> > So show some concrete patches and examples, hand waving and
> > assumptions is just a waste of everyones time.
>
> Average struct bio memory footprint ranks near the bottom of the list of
> things that suck most about Linux storage. At idle I see 8K in use
> (reserves); during updatedb it spikes occasionally to 50K; under a
> heavy load generated by ddsnap on a storage box it sometimes goes to
> 100K with bio throttling in place. Really not moving the needle.

Then, again, stop wasting time on this subject. Just because struct bio
isn't a huge bloat is absolutely no justification for adding extra
members to it. It's not just about system wide bloat.

> On the other hand, vm writeout deadlock ranks smack dab at the top of
> the list, so that is where the patching effort must go for the
> forseeable future. Without bio throttling, the ddsnap load can go to
> 24 MB for struct bio alone. That definitely moves the needle. in
> short, we save 3,200 times more memory by putting decent throttling in
> place than by saving an int in struct bio.

Then fix the damn vm writeout. I always thought it was silly to depend
on the block layer for any sort of throttling. If it's not a system wide
problem, then throttle the io count in the make_request_fn handler of
that problematic driver.

> That said, I did a little analysis to get an idea of where the soft
> targets are in struct bio, and to get to know the bio layer a little
> better. Maybe these few hints will get somebody interested enough to
> look further.
>
> > > It would be interesting to see if bi_bdev could be made read only.
> > > Generally, each stage in the block device stack knows what the next
> > > stage is going to be, so why do we have to write that in the bio?
> > > For error reporting from interrupt context? Anyway, if Evgeniy
> > > wants to do the patch, I will happily unload the task of convincing
> > > you that random fields are/are not needed in struct bio :-)
> >
> > It's a trade off, otherwise you'd have to pass the block device
> > around a lot.
>
> Which costs very little, probably less than trashing an extra field's
> worth of cache.

Again, you can make that argument for most of the members. It's a
non-starter.

> > And it's, again, a design issue. A bio contains
> > destination information, that means device/offset/size information.
> > I'm all for shaving structure bytes where it matters, but not for the
> > sake of sacrificing code stability or design. I consider struct bio
> > quite lean and have worked hard to keep it that way. In fact, iirc,
> > the only addition to struct bio since 2001 is the iommu front/back
> > size members. And I resisted those for quite a while.
>
> You did not comment on the one about putting the bio destructor in
> the ->endio handler, which looks dead simple. The majority of cases
> just use the default endio handler and the default destructor. Of the
> remaining cases, where a specialized destructor is needed, typically a
> specialized endio handler is too, so combining is free. There are few
> if any cases where a new specialized endio handler would need to be
> written.

We could do that without too much work, I agree.

> As far as code stability goes, current kernels are horribly unstable in
> a variety of contexts because of memory deadlock and slowdowns related
> to the attempt to fix the problem via dirty memory limits. Accurate
> throttling of bio traffic is one of the two key requirements to fix
> this instability, the other other is accurate writeout path reserve
> management, which is only partially addressed by BIO_POOL.

Which, as written above and stated many times over the years on lkml, is
not a block layer issue imho.

> Nice to see you jumping in Jens. Now it is over to the other side of
> the thread where Evgeniy has posted a patch that a) grants your wish to
> add no new field in struct bio and b) does not fix the problem.

This is why it's impossible for me to have any sort of constructive
conversation with you.

--
Jens Axboe

2007-08-13 12:54:45

by Jens Axboe

[permalink] [raw]
Subject: Re: Distributed storage.

On Mon, Aug 13 2007, Jens Axboe wrote:
> > You did not comment on the one about putting the bio destructor in
> > the ->endio handler, which looks dead simple. The majority of cases
> > just use the default endio handler and the default destructor. Of the
> > remaining cases, where a specialized destructor is needed, typically a
> > specialized endio handler is too, so combining is free. There are few
> > if any cases where a new specialized endio handler would need to be
> > written.
>
> We could do that without too much work, I agree.

But that idea fails as well, since reference counts and IO completion
are two completely seperate entities. So unless end IO just happens to
be the last user holding a reference to the bio, you cannot free it.

--
Jens Axboe

2007-08-13 13:04:49

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: Block device throttling [Re: Distributed storage.]

On Sun, Aug 12, 2007 at 11:44:00PM -0700, Daniel Phillips ([email protected]) wrote:
> On Sunday 12 August 2007 22:36, I wrote:
> > Note! There are two more issues I forgot to mention earlier.
>
> Oops, and there is also:
>
> 3) The bio throttle, which is supposed to prevent deadlock, can itself
> deadlock. Let me see if I can remember how it goes.
>
> * generic_make_request puts a bio in flight
> * the bio gets past the throttle and initiates network IO
> * net calls sk_alloc->alloc_pages->shrink_caches
> * shrink_caches submits a bio recursively to our block device
> * this bio blocks on the throttle
> * net may never get the memory it needs, and we are wedged

If system is in such condition, it is already broken - throttle limit
must be lowered (next time) not to allow such situation.

--
Evgeniy Polyakov

2007-08-13 13:06:38

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [1/1] Block device throttling [Re: Distributed storage.]

Hi Daniel.

On Sun, Aug 12, 2007 at 04:16:10PM -0700, Daniel Phillips ([email protected]) wrote:
> Your patch is close to the truth, but it needs to throttle at the top
> (virtual) end of each block device stack instead of the bottom
> (physical) end. It does head in the direction of eliminating your own
> deadlock risk indeed, however there are block devices it does not
> cover.

I decided to limit physical devices just because any limit on top of
virtual one is not correct. When system recharges bio from virtual
device to physical, and the latter is full, virtual device will not
accept any new blocks for that physical device, but can accept for
another ones. That was created specially to allow fair use for network
and physical storages.

--
Evgeniy Polyakov

2007-08-13 13:08:33

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: Block device throttling [Re: Distributed storage.]

On Sun, Aug 12, 2007 at 10:36:23PM -0700, Daniel Phillips ([email protected]) wrote:
> (previous incomplete message sent accidentally)
>
> On Wednesday 08 August 2007 02:54, Evgeniy Polyakov wrote:
> > On Tue, Aug 07, 2007 at 10:55:38PM +0200, Jens Axboe wrote:
> >
> > So, what did we decide? To bloat bio a bit (add a queue pointer) or
> > to use physical device limits? The latter requires to replace all
> > occurence of bio->bi_bdev = something_new with blk_set_bdev(bio,
> > somthing_new), where queue limits will be appropriately charged. So
> > far I'm testing second case, but I only changed DST for testing, can
> > change all other users if needed though.
>
> Adding a queue pointer to struct bio and using physical device limits as
> in your posted patch both suffer from the same problem: you release the
> throttling on the previous queue when the bio moves to a new one, which
> is a bug because memory consumption on the previous queue then becomes
> unbounded, or limited only by the number of struct requests that can be
> allocated. In other words, it reverts to the same situation we have
> now as soon as the IO stack has more than one queue. (Just a shorter
> version of my previous post.)

No. Since all requests for virtual device end up in physical devices,
which have limits, this mechanism works. Virtual device will essentially
call either generic_make_request() for new physical device (and thus
will sleep is limit is over), or will process bios directly, but in that
case it will sleep in generic_make_request() for virutal device.

> 1) One throttle count per submitted bio is too crude a measure. A bio
> can carry as few as one page or as many as 256 pages. If you take only

It does not matter - we can count bytes, pages, bio vectors or whatever
we like, its just a matter of counter and can be changed without
problem.

> 2) Exposing the per-block device throttle limits via sysfs or similar is
> really not a good long term solution for system administration.
> Imagine our help text: "just keep trying smaller numbers until your
> system deadlocks". We really need to figure this out internally and
> get it correct. I can see putting in a temporary userspace interface
> just for experimentation, to help determine what really is safe, and
> what size the numbers should be to approach optimal throughput in a
> fully loaded memory state.

Well, we already have number of such 'supposed-to-be-automatic'
variables exported to userspace, so this will not change a picture,
frankly I do not care if there will or will not be any sysfs exported
tunable, eventually we can remove it or do not create at all.

--
Evgeniy Polyakov

2007-08-13 13:16:20

by Daniel Phillips

[permalink] [raw]
Subject: Re: Distributed storage.

On Monday 13 August 2007 00:28, Jens Axboe wrote:
> On Sun, Aug 12 2007, Daniel Phillips wrote:
> > Right, that is done by bi_vcnt. I meant bi_max_vecs, which you can
> > derive efficiently from BIO_POOL_IDX() provided the bio was
> > allocated in the standard way.
>
> That would only be feasible, if we ruled that any bio in the system
> must originate from the standard pools.

Not at all.

> > This leaves a little bit of clean up to do for bios not allocated
> > from a standard pool.
>
> Please suggest how to do such a cleanup.

Easy, use the BIO_POOL bits to know the bi_max_size, the same as for a
bio from the standard pool. Just put the power of two size in the bits
and map that number to the standard pool arrangement with a table
lookup.

> > On the other hand, vm writeout deadlock ranks smack dab at the top
> > of the list, so that is where the patching effort must go for the
> > forseeable future. Without bio throttling, the ddsnap load can go
> > to 24 MB for struct bio alone. That definitely moves the needle.
> > in short, we save 3,200 times more memory by putting decent
> > throttling in place than by saving an int in struct bio.
>
> Then fix the damn vm writeout. I always thought it was silly to
> depend on the block layer for any sort of throttling. If it's not a
> system wide problem, then throttle the io count in the
> make_request_fn handler of that problematic driver.

It is a system wide problem. Every block device needs throttling,
otherwise queues expand without limit. Currently, block devices that
use the standard request library get a slipshod form of throttling for
free in the form of limiting in-flight request structs. Because the
amount of IO carried by a single request can vary by two orders of
magnitude, the system behavior of this approach is far from
predictable.

> > You did not comment on the one about putting the bio destructor in
> > the ->endio handler, which looks dead simple. The majority of
> > cases just use the default endio handler and the default
> > destructor. Of the remaining cases, where a specialized destructor
> > is needed, typically a specialized endio handler is too, so
> > combining is free. There are few if any cases where a new
> > specialized endio handler would need to be written.
>
> We could do that without too much work, I agree.

OK, we got one and another is close to cracking, enough of that.

> > As far as code stability goes, current kernels are horribly
> > unstable in a variety of contexts because of memory deadlock and
> > slowdowns related to the attempt to fix the problem via dirty
> > memory limits. Accurate throttling of bio traffic is one of the
> > two key requirements to fix this instability, the other other is
> > accurate writeout path reserve management, which is only partially
> > addressed by BIO_POOL.
>
> Which, as written above and stated many times over the years on lkml,
> is not a block layer issue imho.

Whoever stated that was wrong, but this should be no surprise. There
have been many wrong things said about this particular bug over the
years. The one thing that remains constant is, Linux continues to
deadlock under a variety of loads both with and without network
involvement, making it effectively useless as a storage platform.

These deadlocks are first and foremost, block layer deficiencies. Even
the network becomes part of the problem only because it lies in the
block IO path.

Regards,

Daniel

2007-08-13 13:19:34

by Daniel Phillips

[permalink] [raw]
Subject: Re: Distributed storage.

On Monday 13 August 2007 00:45, Jens Axboe wrote:
> On Mon, Aug 13 2007, Jens Axboe wrote:
> > > You did not comment on the one about putting the bio destructor
> > > in the ->endio handler, which looks dead simple. The majority of
> > > cases just use the default endio handler and the default
> > > destructor. Of the remaining cases, where a specialized
> > > destructor is needed, typically a specialized endio handler is
> > > too, so combining is free. There are few if any cases where a
> > > new specialized endio handler would need to be written.
> >
> > We could do that without too much work, I agree.
>
> But that idea fails as well, since reference counts and IO completion
> are two completely seperate entities. So unless end IO just happens
> to be the last user holding a reference to the bio, you cannot free
> it.

That is not a problem. When bio_put hits zero it calls ->endio instead
of the destructor. The ->endio sees that the count is zero and
destroys the bio.

Regards,

Daniel

2007-08-13 13:21:16

by Jens Axboe

[permalink] [raw]
Subject: Re: Distributed storage.

On Mon, Aug 13 2007, Daniel Phillips wrote:
> On Monday 13 August 2007 00:28, Jens Axboe wrote:
> > On Sun, Aug 12 2007, Daniel Phillips wrote:
> > > Right, that is done by bi_vcnt. I meant bi_max_vecs, which you can
> > > derive efficiently from BIO_POOL_IDX() provided the bio was
> > > allocated in the standard way.
> >
> > That would only be feasible, if we ruled that any bio in the system
> > must originate from the standard pools.
>
> Not at all.
>
> > > This leaves a little bit of clean up to do for bios not allocated
> > > from a standard pool.
> >
> > Please suggest how to do such a cleanup.
>
> Easy, use the BIO_POOL bits to know the bi_max_size, the same as for a
> bio from the standard pool. Just put the power of two size in the bits
> and map that number to the standard pool arrangement with a table
> lookup.

So reserve a bit that tells you how to interpret the (now) 3 remaining
bits. Doesn't sound very pretty, does it?

> > > On the other hand, vm writeout deadlock ranks smack dab at the top
> > > of the list, so that is where the patching effort must go for the
> > > forseeable future. Without bio throttling, the ddsnap load can go
> > > to 24 MB for struct bio alone. That definitely moves the needle.
> > > in short, we save 3,200 times more memory by putting decent
> > > throttling in place than by saving an int in struct bio.
> >
> > Then fix the damn vm writeout. I always thought it was silly to
> > depend on the block layer for any sort of throttling. If it's not a
> > system wide problem, then throttle the io count in the
> > make_request_fn handler of that problematic driver.
>
> It is a system wide problem. Every block device needs throttling,
> otherwise queues expand without limit. Currently, block devices that
> use the standard request library get a slipshod form of throttling for
> free in the form of limiting in-flight request structs. Because the
> amount of IO carried by a single request can vary by two orders of
> magnitude, the system behavior of this approach is far from
> predictable.

Is it? Consider just 10 standard sata disks. The next kernel revision
will have sg chaining support, so that allows 32MiB per request. Even if
we disregard reads (not so interesting in this discussion) and just look
at potentially pinned dirty data in a single queue, that number comes to
4GiB PER disk. Or 40GiB for 10 disks. Auch.

So I still think that this throttling needs to happen elsewhere, you
cannot rely the block layer throttling globally or for a single device.
It just doesn't make sense.

> > > You did not comment on the one about putting the bio destructor in
> > > the ->endio handler, which looks dead simple. The majority of
> > > cases just use the default endio handler and the default
> > > destructor. Of the remaining cases, where a specialized destructor
> > > is needed, typically a specialized endio handler is too, so
> > > combining is free. There are few if any cases where a new
> > > specialized endio handler would need to be written.
> >
> > We could do that without too much work, I agree.
>
> OK, we got one and another is close to cracking, enough of that.

No we did not, I already failed this one in the next mail.

> > > As far as code stability goes, current kernels are horribly
> > > unstable in a variety of contexts because of memory deadlock and
> > > slowdowns related to the attempt to fix the problem via dirty
> > > memory limits. Accurate throttling of bio traffic is one of the
> > > two key requirements to fix this instability, the other other is
> > > accurate writeout path reserve management, which is only partially
> > > addressed by BIO_POOL.
> >
> > Which, as written above and stated many times over the years on lkml,
> > is not a block layer issue imho.
>
> Whoever stated that was wrong, but this should be no surprise. There
> have been many wrong things said about this particular bug over the
> years. The one thing that remains constant is, Linux continues to
> deadlock under a variety of loads both with and without network
> involvement, making it effectively useless as a storage platform.
>
> These deadlocks are first and foremost, block layer deficiencies. Even
> the network becomes part of the problem only because it lies in the
> block IO path.

The block layer has NEVER guaranteed throttling, so it can - by
definition - not be a block layer deficiency.

--
Jens Axboe

2007-08-13 13:22:05

by Jens Axboe

[permalink] [raw]
Subject: Re: Distributed storage.

On Mon, Aug 13 2007, Daniel Phillips wrote:
> On Monday 13 August 2007 00:45, Jens Axboe wrote:
> > On Mon, Aug 13 2007, Jens Axboe wrote:
> > > > You did not comment on the one about putting the bio destructor
> > > > in the ->endio handler, which looks dead simple. The majority of
> > > > cases just use the default endio handler and the default
> > > > destructor. Of the remaining cases, where a specialized
> > > > destructor is needed, typically a specialized endio handler is
> > > > too, so combining is free. There are few if any cases where a
> > > > new specialized endio handler would need to be written.
> > >
> > > We could do that without too much work, I agree.
> >
> > But that idea fails as well, since reference counts and IO completion
> > are two completely seperate entities. So unless end IO just happens
> > to be the last user holding a reference to the bio, you cannot free
> > it.
>
> That is not a problem. When bio_put hits zero it calls ->endio instead
> of the destructor. The ->endio sees that the count is zero and
> destroys the bio.

You can't be serious? You'd stall end io completion notification because
someone holds a reference to a bio. Surely you jest.

Needless to say, that will never go in.

--
Jens Axboe

2007-08-13 13:23:50

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: Distributed storage.

On Mon, Aug 13, 2007 at 02:08:57AM -0700, Daniel Phillips ([email protected]) wrote:
> > But that idea fails as well, since reference counts and IO completion
> > are two completely seperate entities. So unless end IO just happens
> > to be the last user holding a reference to the bio, you cannot free
> > it.
>
> That is not a problem. When bio_put hits zero it calls ->endio instead
> of the destructor. The ->endio sees that the count is zero and
> destroys the bio.

This is not a very good solution, since it requires all users of the
bios to know how to free it. Right now it is hidden.
And adds additional atomic check (although reading is quite fast) in the
end_io. And for what purpose? To eat 8 bytes on 64bit platform? This
will not reduce its size noticebly, so the same number of bios will be
in the cache's page, so what is a gain? All this cleanups and logic
complicatins should be performed only if after size shring increased
number of bios can fit into cache's page, will it be done after such
cleanups?

--
Evgeniy Polyakov

2007-08-13 13:38:58

by Daniel Phillips

[permalink] [raw]
Subject: Re: Distributed storage.

On Monday 13 August 2007 02:13, Jens Axboe wrote:
> On Mon, Aug 13 2007, Daniel Phillips wrote:
> > On Monday 13 August 2007 00:45, Jens Axboe wrote:
> > > On Mon, Aug 13 2007, Jens Axboe wrote:
> > > > > You did not comment on the one about putting the bio
> > > > > destructor in the ->endio handler, which looks dead simple.
> > > > > The majority of cases just use the default endio handler and
> > > > > the default destructor. Of the remaining cases, where a
> > > > > specialized destructor is needed, typically a specialized
> > > > > endio handler is too, so combining is free. There are few if
> > > > > any cases where a new specialized endio handler would need to
> > > > > be written.
> > > >
> > > > We could do that without too much work, I agree.
> > >
> > > But that idea fails as well, since reference counts and IO
> > > completion are two completely seperate entities. So unless end IO
> > > just happens to be the last user holding a reference to the bio,
> > > you cannot free it.
> >
> > That is not a problem. When bio_put hits zero it calls ->endio
> > instead of the destructor. The ->endio sees that the count is zero
> > and destroys the bio.
>
> You can't be serious? You'd stall end io completion notification
> because someone holds a reference to a bio.

Of course not. Nothing I said stops endio from being called in the
usual way as well. For this to work, endio just needs to know that one
call means "end" and the other means "destroy", this is trivial.

Regards,

Daniel

2007-08-13 13:42:18

by Jens Axboe

[permalink] [raw]
Subject: Re: Distributed storage.

On Mon, Aug 13 2007, Daniel Phillips wrote:
> On Monday 13 August 2007 02:13, Jens Axboe wrote:
> > On Mon, Aug 13 2007, Daniel Phillips wrote:
> > > On Monday 13 August 2007 00:45, Jens Axboe wrote:
> > > > On Mon, Aug 13 2007, Jens Axboe wrote:
> > > > > > You did not comment on the one about putting the bio
> > > > > > destructor in the ->endio handler, which looks dead simple.
> > > > > > The majority of cases just use the default endio handler and
> > > > > > the default destructor. Of the remaining cases, where a
> > > > > > specialized destructor is needed, typically a specialized
> > > > > > endio handler is too, so combining is free. There are few if
> > > > > > any cases where a new specialized endio handler would need to
> > > > > > be written.
> > > > >
> > > > > We could do that without too much work, I agree.
> > > >
> > > > But that idea fails as well, since reference counts and IO
> > > > completion are two completely seperate entities. So unless end IO
> > > > just happens to be the last user holding a reference to the bio,
> > > > you cannot free it.
> > >
> > > That is not a problem. When bio_put hits zero it calls ->endio
> > > instead of the destructor. The ->endio sees that the count is zero
> > > and destroys the bio.
> >
> > You can't be serious? You'd stall end io completion notification
> > because someone holds a reference to a bio.
>
> Of course not. Nothing I said stops endio from being called in the
> usual way as well. For this to work, endio just needs to know that one
> call means "end" and the other means "destroy", this is trivial.

Sorry Daniel, but your suggestions would do nothing more than uglify the
code and design.

--
Jens Axboe

2007-08-13 13:44:44

by Daniel Phillips

[permalink] [raw]
Subject: Re: Distributed storage.

On Monday 13 August 2007 02:18, Evgeniy Polyakov wrote:
> On Mon, Aug 13, 2007 at 02:08:57AM -0700, Daniel Phillips
([email protected]) wrote:
> > > But that idea fails as well, since reference counts and IO
> > > completion are two completely seperate entities. So unless end IO
> > > just happens to be the last user holding a reference to the bio,
> > > you cannot free it.
> >
> > That is not a problem. When bio_put hits zero it calls ->endio
> > instead of the destructor. The ->endio sees that the count is zero
> > and destroys the bio.
>
> This is not a very good solution, since it requires all users of the
> bios to know how to free it.

No, only the specific ->endio needs to know that, which is set by the
bio owner, so this knowledge lies in exactly the right place. A small
handful of generic endios all with the same destructor are used nearly
everywhere.

> Right now it is hidden.
> And adds additional atomic check (although reading is quite fast) in
> the end_io.

Actual endio happens once in the lifetime of the transfer, this read
will be entirely lost in the noise.

> And for what purpose? To eat 8 bytes on 64bit platform?
> This will not reduce its size noticebly, so the same number of bios
> will be in the cache's page, so what is a gain? All this cleanups and
> logic complicatins should be performed only if after size shring
> increased number of bios can fit into cache's page, will it be done
> after such cleanups?

Well, exactly, My point from the beginning was that the size of struct
bio is not even close to being a problem and adding a few bytes to it
in the interest of doing the cleanest fix to a core kernel bug is just
not a dominant issue.

I suppose that leaving out the word "bloated" and skipping straight to
the "doesn't matter" proof would have saved some bandwidth.

Regards,

Daniel

2007-08-13 13:46:57

by Daniel Phillips

[permalink] [raw]
Subject: Re: Distributed storage.

On Monday 13 August 2007 03:06, Jens Axboe wrote:
> On Mon, Aug 13 2007, Daniel Phillips wrote:
> > Of course not. Nothing I said stops endio from being called in the
> > usual way as well. For this to work, endio just needs to know that
> > one call means "end" and the other means "destroy", this is
> > trivial.
>
> Sorry Daniel, but your suggestions would do nothing more than uglify
> the code and design.

Pretty much exactly what was said about shrinking struct page, ask Bill.
The difference was, shrinking struct page actually mattered whereas
shrinking struct bio does not, and neither does expanding it by a few
bytes.

Regards,

Daniel

2007-08-13 13:50:03

by Jens Axboe

[permalink] [raw]
Subject: Re: Distributed storage.

On Mon, Aug 13 2007, Daniel Phillips wrote:
> On Monday 13 August 2007 03:06, Jens Axboe wrote:
> > On Mon, Aug 13 2007, Daniel Phillips wrote:
> > > Of course not. Nothing I said stops endio from being called in the
> > > usual way as well. For this to work, endio just needs to know that
> > > one call means "end" and the other means "destroy", this is
> > > trivial.
> >
> > Sorry Daniel, but your suggestions would do nothing more than uglify
> > the code and design.
>
> Pretty much exactly what was said about shrinking struct page, ask Bill.
> The difference was, shrinking struct page actually mattered whereas
> shrinking struct bio does not, and neither does expanding it by a few
> bytes.

Lets back this up a bit - this whole thing began with you saying that
struct bio was bloated already, which I said wasn't true. You then
continued to hand wave your wave through various suggestions to trim the
obvious fat from that structure, none of which were nice or feasible.

I never compared the bio to struct page, I'd obviously agree that
shrinking struct page was a worthy goal and that it'd be ok to uglify
some code to do that. The same isn't true for struct bio.

And we can expand struct bio if we have to, naturally. And I've done it
before, which I wrote in the initial mail. I just don't want to do it
casually, then it WILL be bloated all of a sudden. Your laissez faire
attitude towards adding members to struct bio "oh I'll just add it and
someone less lazy than me will fix it up in the future" makes me happy
that you are not maintaining anything that I use.

I'll stop replying to your mails until something interesting surfaces.
I've already made my points clear about both the above and the
throttling. And I'd advise you to let Evgeniy take this forward, he
seems a lot more adept to actually getting CODE done and - at least from
my current and past perspective - is someone you can actually have a
fruitful conversation with.

--
Jens Axboe

2007-08-13 13:52:40

by Daniel Phillips

[permalink] [raw]
Subject: Re: Distributed storage.

On Monday 13 August 2007 03:22, Jens Axboe wrote:
> I never compared the bio to struct page, I'd obviously agree that
> shrinking struct page was a worthy goal and that it'd be ok to uglify
> some code to do that. The same isn't true for struct bio.

I thought I just said that.

Regards,

Daniel

2007-08-13 14:06:09

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: Distributed storage.

On Mon, Aug 13, 2007 at 03:12:33AM -0700, Daniel Phillips ([email protected]) wrote:
> > This is not a very good solution, since it requires all users of the
> > bios to know how to free it.
>
> No, only the specific ->endio needs to know that, which is set by the
> bio owner, so this knowledge lies in exactly the right place. A small
> handful of generic endios all with the same destructor are used nearly
> everywhere.

That is what I meant - there will be no way to just alloc a bio and put
it, helpers for generic bio sets must be exported and each and every
bi_end_io() must be changed to check reference counter and they must
know how they were allocated.

> > Right now it is hidden.
> > And adds additional atomic check (although reading is quite fast) in
> > the end_io.
>
> Actual endio happens once in the lifetime of the transfer, this read
> will be entirely lost in the noise.

Not always. Sometimes it is called multiple times, but all bi_end_io()
callbacks I checked (I believe all in mainline tree) tests if bi_size is
zero or not.

Endio callback is of course quite rare and additional atomic
reading will not kill the system, but why introduce another read?
It is possible to provide a flag for endio callback that it is last, but
it still requires to change every single callback - why do we want this?

> > And for what purpose? To eat 8 bytes on 64bit platform?
> > This will not reduce its size noticebly, so the same number of bios
> > will be in the cache's page, so what is a gain? All this cleanups and
> > logic complicatins should be performed only if after size shring
> > increased number of bios can fit into cache's page, will it be done
> > after such cleanups?
>
> Well, exactly, My point from the beginning was that the size of struct
> bio is not even close to being a problem and adding a few bytes to it
> in the interest of doing the cleanest fix to a core kernel bug is just
> not a dominant issue.

So, I'm a bit lost...

You say it is too big and some parts can be removed or combined, and
then that size does not matter. Last/not-last checks in the code is not
clear design, so I do not see why it is needed at all if not for size
shrinking.

> I suppose that leaving out the word "bloated" and skipping straight to
> the "doesn't matter" proof would have saved some bandwidth.

:) Likely it will.

--
Evgeniy Polyakov

2007-08-13 14:06:35

by Daniel Phillips

[permalink] [raw]
Subject: Re: Block device throttling [Re: Distributed storage.]

On Monday 13 August 2007 01:14, Evgeniy Polyakov wrote:
> > Oops, and there is also:
> >
> > 3) The bio throttle, which is supposed to prevent deadlock, can
> > itself deadlock. Let me see if I can remember how it goes.
> >
> > * generic_make_request puts a bio in flight
> > * the bio gets past the throttle and initiates network IO
> > * net calls sk_alloc->alloc_pages->shrink_caches
> > * shrink_caches submits a bio recursively to our block device
> > * this bio blocks on the throttle
> > * net may never get the memory it needs, and we are wedged
>
> If system is in such condition, it is already broken - throttle limit
> must be lowered (next time) not to allow such situation.

Agreed that the system is broken, however lowering the throttle limit
gives no improvement in this case.

This is not theoretical, but a testable, repeatable result.
Instructions to reproduce should show up tomorrow.

This bug is now solved in a kludgy way. Now, Peter's patch set offers a
much cleaner way to fix this little problem, along with at least one
other nasty that it already fixed.

Regards,

Daniel

2007-08-13 14:18:49

by Daniel Phillips

[permalink] [raw]
Subject: Re: Block device throttling [Re: Distributed storage.]

On Monday 13 August 2007 01:23, Evgeniy Polyakov wrote:
> On Sun, Aug 12, 2007 at 10:36:23PM -0700, Daniel Phillips
([email protected]) wrote:
> > (previous incomplete message sent accidentally)
> >
> > On Wednesday 08 August 2007 02:54, Evgeniy Polyakov wrote:
> > > On Tue, Aug 07, 2007 at 10:55:38PM +0200, Jens Axboe wrote:
> > >
> > > So, what did we decide? To bloat bio a bit (add a queue pointer)
> > > or to use physical device limits? The latter requires to replace
> > > all occurence of bio->bi_bdev = something_new with
> > > blk_set_bdev(bio, somthing_new), where queue limits will be
> > > appropriately charged. So far I'm testing second case, but I only
> > > changed DST for testing, can change all other users if needed
> > > though.
> >
> > Adding a queue pointer to struct bio and using physical device
> > limits as in your posted patch both suffer from the same problem:
> > you release the throttling on the previous queue when the bio moves
> > to a new one, which is a bug because memory consumption on the
> > previous queue then becomes unbounded, or limited only by the
> > number of struct requests that can be allocated. In other words,
> > it reverts to the same situation we have now as soon as the IO
> > stack has more than one queue. (Just a shorter version of my
> > previous post.)
>
> No. Since all requests for virtual device end up in physical devices,
> which have limits, this mechanism works. Virtual device will
> essentially call either generic_make_request() for new physical
> device (and thus will sleep is limit is over), or will process bios
> directly, but in that case it will sleep in generic_make_request()
> for virutal device.

What can happen is, as soon as you unthrottle the previous queue,
another thread can come in and put another request on it. Sure, that
thread will likely block on the physical throttle and so will the rest
of the incoming threads, but it still allows the higher level queue to
grow past any given limit, with the help of lots of threads. JVM for
example?

Say you have a device mapper device with some physical device sitting
underneath, the classic use case for this throttle code. Say 8,000
threads each submit an IO in parallel. The device mapper mapping
function will be called 8,000 times with associated resource
allocations, regardless of any throttling on the physical device queue.

Anyway, your approach is awfully close to being airtight, there is just
a small hole. I would be more than happy to be proved wrong about
that, but the more I look, the more I see that hole.

> > 1) One throttle count per submitted bio is too crude a measure. A
> > bio can carry as few as one page or as many as 256 pages. If you
> > take only
>
> It does not matter - we can count bytes, pages, bio vectors or
> whatever we like, its just a matter of counter and can be changed
> without problem.

Quite true. In some cases the simple inc/dec per bio works just fine.
But the general case where finer granularity is required comes up in
existing code, so there needs to be a plan.

Regards,

Daniel

2007-08-13 14:43:30

by Daniel Phillips

[permalink] [raw]
Subject: Re: Distributed storage.

On Monday 13 August 2007 04:03, Evgeniy Polyakov wrote:
> On Mon, Aug 13, 2007 at 03:12:33AM -0700, Daniel Phillips
([email protected]) wrote:
> > > This is not a very good solution, since it requires all users of
> > > the bios to know how to free it.
> >
> > No, only the specific ->endio needs to know that, which is set by
> > the bio owner, so this knowledge lies in exactly the right place.
> > A small handful of generic endios all with the same destructor are
> > used nearly everywhere.
>
> That is what I meant - there will be no way to just alloc a bio and
> put it, helpers for generic bio sets must be exported and each and
> every bi_end_io() must be changed to check reference counter and they
> must know how they were allocated.

There are fewer non-generic bio allocators than you think.

> Endio callback is of course quite rare and additional atomic
> reading will not kill the system, but why introduce another read?
> It is possible to provide a flag for endio callback that it is last,
> but it still requires to change every single callback - why do we
> want this?

We don't. Struct bio does not need to be shrunk. Jens wanted to talk
about what fields could be eliminated if we wanted to shrink it. It is
about time to let that lie, don't you think?

> So, I'm a bit lost...
>
> You say it is too big

Did not say that.

> and some parts can be removed or combined

True.

> and then that size does not matter.

Also true, backed up by numbers on real systems.

> Last/not-last checks in the code is
> not clear design, so I do not see why it is needed at all if not for
> size shrinking.

Not needed, indeed. Accurate throttling is needed. If the best way to
throttle requires expanding struct bio a little then we should not let
concerns about the cost of an int or two stand in the way. Like Jens,
I am more concerned about the complexity cost, and that is minimized in
my opinion by throttling in the generic code rather than with custom
code in each specialized block driver.

Your patch does throttle in the generic code, great. Next thing is to
be sure that it completely closes the window for reserve leakage, which
is not yet clear.

Regards,

Daniel

2007-08-13 14:47:21

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: Block device throttling [Re: Distributed storage.]

On Mon, Aug 13, 2007 at 04:04:26AM -0700, Daniel Phillips ([email protected]) wrote:
> On Monday 13 August 2007 01:14, Evgeniy Polyakov wrote:
> > > Oops, and there is also:
> > >
> > > 3) The bio throttle, which is supposed to prevent deadlock, can
> > > itself deadlock. Let me see if I can remember how it goes.
> > >
> > > * generic_make_request puts a bio in flight
> > > * the bio gets past the throttle and initiates network IO
> > > * net calls sk_alloc->alloc_pages->shrink_caches
> > > * shrink_caches submits a bio recursively to our block device
> > > * this bio blocks on the throttle
> > > * net may never get the memory it needs, and we are wedged
> >
> > If system is in such condition, it is already broken - throttle limit
> > must be lowered (next time) not to allow such situation.
>
> Agreed that the system is broken, however lowering the throttle limit
> gives no improvement in this case.

How is it ever possible? The whole idea of throttling is to remove such
situation, and now you say it can not be solved. If limit is for 1gb of
pending block io, and system has for example 2gbs of ram (or any other
resonable parameters), then there is no way we can deadlock in
allocation, since it will not force page reclaim mechanism.

--
Evgeniy Polyakov

2007-08-13 14:55:56

by Daniel Phillips

[permalink] [raw]
Subject: Re: Block device throttling [Re: Distributed storage.]

On Monday 13 August 2007 05:04, Evgeniy Polyakov wrote:
> On Mon, Aug 13, 2007 at 04:04:26AM -0700, Daniel Phillips
([email protected]) wrote:
> > On Monday 13 August 2007 01:14, Evgeniy Polyakov wrote:
> > > > Oops, and there is also:
> > > >
> > > > 3) The bio throttle, which is supposed to prevent deadlock, can
> > > > itself deadlock. Let me see if I can remember how it goes.
> > > >
> > > > * generic_make_request puts a bio in flight
> > > > * the bio gets past the throttle and initiates network IO
> > > > * net calls sk_alloc->alloc_pages->shrink_caches
> > > > * shrink_caches submits a bio recursively to our block device
> > > > * this bio blocks on the throttle
> > > > * net may never get the memory it needs, and we are wedged
> > >
> > > If system is in such condition, it is already broken - throttle
> > > limit must be lowered (next time) not to allow such situation.
> >
> > Agreed that the system is broken, however lowering the throttle
> > limit gives no improvement in this case.
>
> How is it ever possible? The whole idea of throttling is to remove
> such situation, and now you say it can not be solved.

It was solved, by not throttling writeout that comes from shrink_caches.
Ugly.

> If limit is for
> 1gb of pending block io, and system has for example 2gbs of ram (or
> any other resonable parameters), then there is no way we can deadlock
> in allocation, since it will not force page reclaim mechanism.

The problem is that sk_alloc (called from our block driver via
socket->write) would recurse into shrink_pages, which recursively
submits IO to our block driver and blocks on the throttle. Subtle
indeed, and yet another demonstration of why vm recursion is a Bad
Thing.

I will find a traceback for you tomorrow, which makes this deadlock much
clearer.

Regards

2007-08-13 14:56:48

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: Block device throttling [Re: Distributed storage.]

On Mon, Aug 13, 2007 at 04:18:03AM -0700, Daniel Phillips ([email protected]) wrote:
> > No. Since all requests for virtual device end up in physical devices,
> > which have limits, this mechanism works. Virtual device will
> > essentially call either generic_make_request() for new physical
> > device (and thus will sleep is limit is over), or will process bios
> > directly, but in that case it will sleep in generic_make_request()
> > for virutal device.
>
> What can happen is, as soon as you unthrottle the previous queue,
> another thread can come in and put another request on it. Sure, that
> thread will likely block on the physical throttle and so will the rest
> of the incoming threads, but it still allows the higher level queue to
> grow past any given limit, with the help of lots of threads. JVM for
> example?

No. You get one slot, and one thread will not be blocked, all others
will. If lucky thread wants to put two requests it will be blocked on
second request, since underlying physical device does not accept requests
anymore an thus caller will sleep.

> Say you have a device mapper device with some physical device sitting
> underneath, the classic use case for this throttle code. Say 8,000
> threads each submit an IO in parallel. The device mapper mapping
> function will be called 8,000 times with associated resource
> allocations, regardless of any throttling on the physical device queue.

Each thread will sleep in generic_make_request(), if limit is specified
correctly, then allocated number of bios will be enough to have a
progress.

Here is an example:

let's say system has 20.000 pages in RAM and 20.000 in swap,
we have 8.000 threads, each one allocates a page, then next page and so
on. System has one virtual device with two physical devices under it,
each device gets half of requests.

We set limit to 4.000 per physical device.

All threads allocate a page and queue it to devices, so all threads
succeeded in its first allocation, and each device has its queue full.
Virtual device does not have a limit (or have it 4.000 too, but since it
was each time recharged, it has zero blocks in-flight).

New thread tries to allocate a page, it is allocated and queued to one
of the devices, but since its queue is full, thread sleeps. So will do
each other.

Thus we ended up allocated 8.000 requests queued, and 8.000 in-flight,
totally 16.000 which is smaller than amount of pages in RAM, so we are
happy.

Consider above as a special kind calculation i.e. number of
_allocated_ pages is always number of physical device multiplied by each
one's in-flight limit. By adjusting in-flight limit and knowing number
of device it is completely possible to eliminate vm deadlock.

If you do not like such calculation, solution is trivial:
we can sleep _after_ ->make_request_fn() in
generic_make_request() until number of in-flight bios is reduced by
bio_endio().

--
Evgeniy Polyakov

2007-08-13 14:58:45

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: Block device throttling [Re: Distributed storage.]

On Mon, Aug 13, 2007 at 05:18:14AM -0700, Daniel Phillips ([email protected]) wrote:
> > If limit is for
> > 1gb of pending block io, and system has for example 2gbs of ram (or
> > any other resonable parameters), then there is no way we can deadlock
> > in allocation, since it will not force page reclaim mechanism.
>
> The problem is that sk_alloc (called from our block driver via
> socket->write) would recurse into shrink_pages, which recursively
> submits IO to our block driver and blocks on the throttle. Subtle
> indeed, and yet another demonstration of why vm recursion is a Bad
> Thing.
>
> I will find a traceback for you tomorrow, which makes this deadlock much
> clearer.

I see how it can happen, but device throttling is a solution we are
trying to complete, which main aim _is_ to remove this problem.

Lower per-device limit, so that the rest of the RAM allowed to
allocate all needed data structures in the network path.
Above example just has 1gb of ram, which should be enough for skbs, if
it is not, decrease limit to 500 mb and so on, until weighted load of
the system allows to always have a forward progress.

--
Evgeniy Polyakov

2007-08-13 15:18:28

by Daniel Phillips

[permalink] [raw]
Subject: Re: Block device throttling [Re: Distributed storage.]

On Monday 13 August 2007 05:18, Evgeniy Polyakov wrote:
> > Say you have a device mapper device with some physical device
> > sitting underneath, the classic use case for this throttle code.
> > Say 8,000 threads each submit an IO in parallel. The device mapper
> > mapping function will be called 8,000 times with associated
> > resource allocations, regardless of any throttling on the physical
> > device queue.
>
> Each thread will sleep in generic_make_request(), if limit is
> specified correctly, then allocated number of bios will be enough to
> have a progress.

The problem is, the sleep does not occur before the virtual device
mapping function is called. Let's consider two devices, a physical
device named pdev and a virtual device sitting on top of it called
vdev. vdev's throttle limit is just one element, but we will see that
in spite of this, two bios can be handled by the vdev's mapping method
before any IO completes, which violates the throttling rules. According
to your patch it works like this:

Thread 1 Thread 2

<no wait because vdev->bio_queued is zero>

vdev->q->bio_queued++

<enter devmapper map method>

blk_set_bdev(bio, pdev)
vdev->bio_queued--

<no wait because vdev->bio_queued is zero>

vdev->q->bio_queued++

<enter devmapper map method>

whoops! Our virtual device mapping
function has now allocated resources
for two in-flight bios in spite of having its
throttle limit set to 1.

Perhaps you never worried about the resources that the device mapper
mapping function allocates to handle each bio and so did not consider
this hole significant. These resources can be significant, as is the
case with ddsnap. It is essential to close that window through with
the virtual device's queue limit may be violated. Not doing so will
allow deadlock.

Regards,

Daniel

2007-08-13 23:28:42

by Daniel Phillips

[permalink] [raw]
Subject: Re: Distributed storage.

On Monday 13 August 2007 02:12, Jens Axboe wrote:
> > It is a system wide problem. Every block device needs throttling,
> > otherwise queues expand without limit. Currently, block devices
> > that use the standard request library get a slipshod form of
> > throttling for free in the form of limiting in-flight request
> > structs. Because the amount of IO carried by a single request can
> > vary by two orders of magnitude, the system behavior of this
> > approach is far from predictable.
>
> Is it? Consider just 10 standard sata disks. The next kernel revision
> will have sg chaining support, so that allows 32MiB per request. Even
> if we disregard reads (not so interesting in this discussion) and
> just look at potentially pinned dirty data in a single queue, that
> number comes to 4GiB PER disk. Or 40GiB for 10 disks. Auch.
>
> So I still think that this throttling needs to happen elsewhere, you
> cannot rely the block layer throttling globally or for a single
> device. It just doesn't make sense.

You are right, so long as the unit of throttle accounting remains one
request. This is not what we do in ddsnap. Instead we inc/dec the
throttle counter by the number of bvecs in each bio, which produces a
nice steady data flow to the disk under a wide variety of loads, and
provides the memory resource bound we require.

One throttle count per bvec will not be the right throttling metric for
every driver. To customize this accounting metric for a given driver
we already have the backing_dev_info structure, which provides
per-device-instance accounting functions and instance data. Perfect!
This allows us to factor the throttling mechanism out of the driver, so
the only thing the driver has to do is define the throttle accounting
if it needs a custom one.

We can avoid affecting the traditional behavior quite easily, for
example if backing_dev_info->throttle_fn (new method) is null then
either not throttle at all (and rely on the struct request in-flight
limit) or we can move the in-flight request throttling logic into core
as the default throttling method, simplifying the request library and
not changing its behavior.

> > These deadlocks are first and foremost, block layer deficiencies.
> > Even the network becomes part of the problem only because it lies
> > in the block IO path.
>
> The block layer has NEVER guaranteed throttling, so it can - by
> definition - not be a block layer deficiency.

The block layer has always been deficient by not providing accurate
throttling, or any throttling at all for some devices. We have
practical proof that this causes deadlock and a good theoretical basis
for describing exactly how it happens.

To be sure, vm and net are co-conspirators, however the block layer
really is the main actor in this little drama.

Regards,

Daniel

2007-08-14 08:47:36

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: Block device throttling [Re: Distributed storage.]

On Mon, Aug 13, 2007 at 06:04:06AM -0700, Daniel Phillips ([email protected]) wrote:
> Perhaps you never worried about the resources that the device mapper
> mapping function allocates to handle each bio and so did not consider
> this hole significant. These resources can be significant, as is the
> case with ddsnap. It is essential to close that window through with
> the virtual device's queue limit may be violated. Not doing so will
> allow deadlock.

This is not a bug, this is special kind of calculation - total limit is
number of physical devices multiplied by theirs limits. It was done
_on purpose_ to allow different device to have different limits (for
example in distributed storage project it is possible to have both remote
and local node in the same device, but local device should not have _any_
limit at all, but network one should).

Virtual device essentially has _no_ limit. And that as done on purpose.

--
Evgeniy Polyakov

2007-08-14 11:13:40

by Daniel Phillips

[permalink] [raw]
Subject: Re: Block device throttling [Re: Distributed storage.]

On Tuesday 14 August 2007 01:46, Evgeniy Polyakov wrote:
> On Mon, Aug 13, 2007 at 06:04:06AM -0700, Daniel Phillips
([email protected]) wrote:
> > Perhaps you never worried about the resources that the device
> > mapper mapping function allocates to handle each bio and so did not
> > consider this hole significant. These resources can be
> > significant, as is the case with ddsnap. It is essential to close
> > that window through with the virtual device's queue limit may be
> > violated. Not doing so will allow deadlock.
>
> This is not a bug, this is special kind of calculation - total limit
> is number of physical devices multiplied by theirs limits. It was
> done _on purpose_ to allow different device to have different limits
> (for example in distributed storage project it is possible to have
> both remote and local node in the same device, but local device
> should not have _any_ limit at all, but network one should).
>
> Virtual device essentially has _no_ limit. And that as done on
> purpose.

And it will not solve the deadlock problem in general. (Maybe it works
for your virtual device, but I wonder...) If the virtual device
allocates memory during generic_make_request then the memory needs to
be throttled.

Regards,

Daniel

2007-08-14 11:31:34

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: Block device throttling [Re: Distributed storage.]

On Tue, Aug 14, 2007 at 04:13:10AM -0700, Daniel Phillips ([email protected]) wrote:
> On Tuesday 14 August 2007 01:46, Evgeniy Polyakov wrote:
> > On Mon, Aug 13, 2007 at 06:04:06AM -0700, Daniel Phillips
> ([email protected]) wrote:
> > > Perhaps you never worried about the resources that the device
> > > mapper mapping function allocates to handle each bio and so did not
> > > consider this hole significant. These resources can be
> > > significant, as is the case with ddsnap. It is essential to close
> > > that window through with the virtual device's queue limit may be
> > > violated. Not doing so will allow deadlock.
> >
> > This is not a bug, this is special kind of calculation - total limit
> > is number of physical devices multiplied by theirs limits. It was
> > done _on purpose_ to allow different device to have different limits
> > (for example in distributed storage project it is possible to have
> > both remote and local node in the same device, but local device
> > should not have _any_ limit at all, but network one should).
> >
> > Virtual device essentially has _no_ limit. And that as done on
> > purpose.
>
> And it will not solve the deadlock problem in general. (Maybe it works
> for your virtual device, but I wonder...) If the virtual device
> allocates memory during generic_make_request then the memory needs to
> be throttled.

Daniel, if device process bio by itself, it has a limit and thus it will
wait in generic_make_request(), if it queues it to different device,
then the same logic applies there. If virutal device does not process
bio, its limit will always be recharged to underlying devices, and
overall limit is equal to number of physical device (or devices which do
process bio) multiplied by theirs limits. This does _work_ and I showed
example how limits are processed and who and where will sleep. This
solution is not narrow fix, please check my examples I showed before.

> Regards,
>
> Daniel

--
Evgeniy Polyakov

2007-08-14 11:36:11

by Daniel Phillips

[permalink] [raw]
Subject: Re: Block device throttling [Re: Distributed storage.]

On Tuesday 14 August 2007 04:30, Evgeniy Polyakov wrote:
> > And it will not solve the deadlock problem in general. (Maybe it
> > works for your virtual device, but I wonder...) If the virtual
> > device allocates memory during generic_make_request then the memory
> > needs to be throttled.
>
> Daniel, if device process bio by itself, it has a limit and thus it
> will wait in generic_make_request()

What will make it wait?

2007-08-14 11:51:48

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: Block device throttling [Re: Distributed storage.]

On Tue, Aug 14, 2007 at 04:35:43AM -0700, Daniel Phillips ([email protected]) wrote:
> On Tuesday 14 August 2007 04:30, Evgeniy Polyakov wrote:
> > > And it will not solve the deadlock problem in general. (Maybe it
> > > works for your virtual device, but I wonder...) If the virtual
> > > device allocates memory during generic_make_request then the memory
> > > needs to be throttled.
> >
> > Daniel, if device process bio by itself, it has a limit and thus it
> > will wait in generic_make_request()
>
> What will make it wait?

gneric_make_request() for given block device.

Example:

virt_device -> do_smth_with_bio ->bio_endio().
|
/ \
phys0 phys1

Each of three devices above works with bio, each one eventually calls
bio_endio() and bio->bi_bdev will be one of the three above devices.

Thus, when system calls generic_make_request(bio->bi_bdev == virt_device),
one of the three limits will be charged, depending on the fact, that
virtual device forward bio to physical devices or not. Actually virtual
device limit will be charged too first, but if bio is forwarded, its
portion will be reduced from virtual device's limit.

Now, if virtual device allocates bio itself (like device mapper), then
this new bio will be forwarded to physical devices via
gneric_make_request() and thus it will sleep in the physical device's
queue, if it is filled.

So, if each of three devices has a limit of 10 bios, then actual number
of bios in flight is maximum 3 * 10, since each device will be charged
up to _its_ maximum limit, not limit for the first device in the chain.

So, you set 10 to virtual device and its can process bio itself (like
send it to network), then this is number of bios in flight, which are
processed by _this_ device and not forwarded further. Actual number of
bios you can flush into virtual device is its own limit plus limits of
all physical devices atached to it.

--
Evgeniy Polyakov

2007-08-14 12:32:54

by Daniel Phillips

[permalink] [raw]
Subject: Re: Block device throttling [Re: Distributed storage.]

On Tuesday 14 August 2007 04:50, Evgeniy Polyakov wrote:
> On Tue, Aug 14, 2007 at 04:35:43AM -0700, Daniel Phillips
([email protected]) wrote:
> > On Tuesday 14 August 2007 04:30, Evgeniy Polyakov wrote:
> > > > And it will not solve the deadlock problem in general. (Maybe
> > > > it works for your virtual device, but I wonder...) If the
> > > > virtual device allocates memory during generic_make_request
> > > > then the memory needs to be throttled.
> > >
> > > Daniel, if device process bio by itself, it has a limit and thus
> > > it will wait in generic_make_request()
> >
> > What will make it wait?
>
> gneric_make_request() for given block device.

Not good enough, that only makes one thread wait. Look here:

http://lkml.org/lkml/2007/8/13/788

An unlimited number of threads can come in, each consuming resources of
the virtual device, and violating the throttling rules.

The throttling of the virtual device must begin in generic_make_request
and last to ->endio. You release the throttle of the virtual device at
the point you remap the bio to an underlying device, which you have
convinced yourself is ok, but it is not. You seem to miss the fact
that whatever resources the virtual device has allocated are no longer
protected by the throttle count *of the virtual device*, or you do not
see why that is a bad thing. It is a very bad thing, roughly like
leaving some shared data outside a spin_lock/unlock.

Regards,

Daniel

2007-08-14 12:48:41

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: Block device throttling [Re: Distributed storage.]

On Tue, Aug 14, 2007 at 05:32:29AM -0700, Daniel Phillips ([email protected]) wrote:
> On Tuesday 14 August 2007 04:50, Evgeniy Polyakov wrote:
> > On Tue, Aug 14, 2007 at 04:35:43AM -0700, Daniel Phillips
> ([email protected]) wrote:
> > > On Tuesday 14 August 2007 04:30, Evgeniy Polyakov wrote:
> > > > > And it will not solve the deadlock problem in general. (Maybe
> > > > > it works for your virtual device, but I wonder...) If the
> > > > > virtual device allocates memory during generic_make_request
> > > > > then the memory needs to be throttled.
> > > >
> > > > Daniel, if device process bio by itself, it has a limit and thus
> > > > it will wait in generic_make_request()
> > >
> > > What will make it wait?
> >
> > gneric_make_request() for given block device.
>
> Not good enough, that only makes one thread wait. Look here:
>
> http://lkml.org/lkml/2007/8/13/788
>
> An unlimited number of threads can come in, each consuming resources of
> the virtual device, and violating the throttling rules.
>
> The throttling of the virtual device must begin in generic_make_request
> and last to ->endio. You release the throttle of the virtual device at
> the point you remap the bio to an underlying device, which you have
> convinced yourself is ok, but it is not. You seem to miss the fact
> that whatever resources the virtual device has allocated are no longer
> protected by the throttle count *of the virtual device*, or you do not

Because it is charged to another device. No matter how many of them are
chained, limit is applied to the last device being used.
So, if you have unlimited number of threads, each one allocates a
request, forward it down to low-level devices, each one will eventually
sleep, but yes, each one _can_ allocate _one_ request before it goes
sleeping. It is done to allow fain-grained limits, since some devices
(like locally attached disks) do not require throttling.

Here is an example with threads you mentioned:
http://article.gmane.org/gmane.linux.file-systems/17644

--
Evgeniy Polyakov

2007-08-14 12:54:41

by Daniel Phillips

[permalink] [raw]
Subject: Re: Block device throttling [Re: Distributed storage.]

On Tuesday 14 August 2007 05:46, Evgeniy Polyakov wrote:
> > The throttling of the virtual device must begin in
> > generic_make_request and last to ->endio. You release the throttle
> > of the virtual device at the point you remap the bio to an
> > underlying device, which you have convinced yourself is ok, but it
> > is not. You seem to miss the fact that whatever resources the
> > virtual device has allocated are no longer protected by the
> > throttle count *of the virtual device*, or you do not
>
> Because it is charged to another device.

Great. You charged the resource to another device, but you did not
limit the amount of resources that the first device can consume. Which
misses the whole point.

Regards,

Daniel

2007-08-27 21:59:39

by Daniel Phillips

[permalink] [raw]
Subject: Re: [1/1] Block device throttling [Re: Distributed storage.]

Say Evgeniy, something I was curious about but forgot to ask you
earlier...

On Wednesday 08 August 2007 03:17, Evgeniy Polyakov wrote:
> ...All oerations are not atomic, since we do not care about precise
> number of bios, but a fact, that we are close or close enough to the
> limit.
> ... in bio->endio
> + q->bio_queued--;

In your proposed patch, what prevents the race:

cpu1 cpu2

read q->bio_queued
q->bio_queued--
write q->bio_queued - 1
Whoops! We leaked a throttle count.

Regards,

Daniel

2007-08-28 09:37:26

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [1/1] Block device throttling [Re: Distributed storage.]

On Mon, Aug 27, 2007 at 02:57:37PM -0700, Daniel Phillips ([email protected]) wrote:
> Say Evgeniy, something I was curious about but forgot to ask you
> earlier...
>
> On Wednesday 08 August 2007 03:17, Evgeniy Polyakov wrote:
> > ...All oerations are not atomic, since we do not care about precise
> > number of bios, but a fact, that we are close or close enough to the
> > limit.
> > ... in bio->endio
> > + q->bio_queued--;
>
> In your proposed patch, what prevents the race:
>
> cpu1 cpu2
>
> read q->bio_queued
> q->bio_queued--
> write q->bio_queued - 1
> Whoops! We leaked a throttle count.

We do not care about one cpu being able to increase its counter higher
than the limit, such inaccuracy (maximum bios in flight thus can be more
than limit, difference is equal to the number of CPUs - 1) is a price
for removing atomic operation. I thought I pointed it in the original
description, but might forget, that if it will be an issue, that atomic
operations can be introduced there. Any uber-precise measurements in the
case when we are close to the edge will not give us any benefit at all,
since were are already in the grey area.

Another possibility is to create a queue/device pointer in the bio
structure to hold original device and then in its backing dev structure
add a callback to recalculate the limit, but it increases the size of
the bio. Do we need this?

> Regards,
>
> Daniel

--
Evgeniy Polyakov

2007-08-28 17:28:44

by Daniel Phillips

[permalink] [raw]
Subject: Re: [1/1] Block device throttling [Re: Distributed storage.]

On Tuesday 28 August 2007 02:35, Evgeniy Polyakov wrote:
> On Mon, Aug 27, 2007 at 02:57:37PM -0700, Daniel Phillips
([email protected]) wrote:
> > Say Evgeniy, something I was curious about but forgot to ask you
> > earlier...
> >
> > On Wednesday 08 August 2007 03:17, Evgeniy Polyakov wrote:
> > > ...All oerations are not atomic, since we do not care about
> > > precise number of bios, but a fact, that we are close or close
> > > enough to the limit.
> > > ... in bio->endio
> > > + q->bio_queued--;
> >
> > In your proposed patch, what prevents the race:
> >
> > cpu1 cpu2
> >
> > read q->bio_queued
> > q->bio_queued--
> > write q->bio_queued - 1
> > Whoops! We leaked a throttle count.
>
> We do not care about one cpu being able to increase its counter
> higher than the limit, such inaccuracy (maximum bios in flight thus
> can be more than limit, difference is equal to the number of CPUs -
> 1) is a price for removing atomic operation. I thought I pointed it
> in the original description, but might forget, that if it will be an
> issue, that atomic operations can be introduced there. Any
> uber-precise measurements in the case when we are close to the edge
> will not give us any benefit at all, since were are already in the
> grey area.

This is not just inaccurate, it is suicide. Keep leaking throttle
counts and eventually all of them will be gone. No more IO
on that block device!

> Another possibility is to create a queue/device pointer in the bio
> structure to hold original device and then in its backing dev
> structure add a callback to recalculate the limit, but it increases
> the size of the bio. Do we need this?

Different issue. Yes, I think we need a nice simple approach like
that, and prove it is stable before worrying about the size cost.

Regards,

Daniel

2007-08-28 17:55:32

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [1/1] Block device throttling [Re: Distributed storage.]

On Tue, Aug 28, 2007 at 10:27:59AM -0700, Daniel Phillips ([email protected]) wrote:
> > We do not care about one cpu being able to increase its counter
> > higher than the limit, such inaccuracy (maximum bios in flight thus
> > can be more than limit, difference is equal to the number of CPUs -
> > 1) is a price for removing atomic operation. I thought I pointed it
> > in the original description, but might forget, that if it will be an
> > issue, that atomic operations can be introduced there. Any
> > uber-precise measurements in the case when we are close to the edge
> > will not give us any benefit at all, since were are already in the
> > grey area.
>
> This is not just inaccurate, it is suicide. Keep leaking throttle
> counts and eventually all of them will be gone. No more IO
> on that block device!

First, because number of increased and decreased operations are the
same, so it will dance around limit in both directions. Second, I
wrote about this race and there is number of ways to deal with it, from
atomic operations to separated counters for in-flight and completed
bios (which can be racy too, but in different angle). Third, if people
can not agree even on much higher layer detail about should bio
structure be increased or not, how we can discuss details of
the preliminary implementation with known issues.

So I can not agree with fatality of the issue, but of course it exists,
and was highlighted.

Let's solve problems in order of their appearence. If bio structure will
be allowed to grow, then the whole patches can be done better, if not,
then there are issues with performance (although the more I think, the
more I become sure that since bio itself is very rarely shared, and thus
requires cloning and alocation/freeing, which itself is much more costly
operation than atomic_sub/dec, it can safely host additional operation).

--
Evgeniy Polyakov

2007-08-28 21:08:36

by Daniel Phillips

[permalink] [raw]
Subject: Re: [1/1] Block device throttling [Re: Distributed storage.]

On Tuesday 28 August 2007 10:54, Evgeniy Polyakov wrote:
> On Tue, Aug 28, 2007 at 10:27:59AM -0700, Daniel Phillips ([email protected]) wrote:
> > > We do not care about one cpu being able to increase its counter
> > > higher than the limit, such inaccuracy (maximum bios in flight
> > > thus can be more than limit, difference is equal to the number of
> > > CPUs - 1) is a price for removing atomic operation. I thought I
> > > pointed it in the original description, but might forget, that if
> > > it will be an issue, that atomic operations can be introduced
> > > there. Any uber-precise measurements in the case when we are
> > > close to the edge will not give us any benefit at all, since were
> > > are already in the grey area.
> >
> > This is not just inaccurate, it is suicide. Keep leaking throttle
> > counts and eventually all of them will be gone. No more IO
> > on that block device!
>
> First, because number of increased and decreased operations are the
> same, so it will dance around limit in both directions.

No. Please go and read it the description of the race again. A count
gets irretrievably lost because the write operation of the first
decrement is overwritten by the second. Data gets lost. Atomic
operations exist to prevent that sort of thing. You either need to use
them or have a deep understanding of SMP read and write ordering in
order to preserve data integrity by some equivalent algorithm.

> Let's solve problems in order of their appearence. If bio structure
> will be allowed to grow, then the whole patches can be done better.

How about like the patch below. This throttles any block driver by
implementing a throttle metric method so that each block driver can
keep track of its own resource consumption in units of its choosing.
As an (important) example, it implements a simple metric for device
mapper devices. Other block devices will work as before, because
they do not define any metric. Short, sweet and untested, which is
why I have not posted it until now.

This patch originally kept its accounting info in backing_dev_info,
however that structure seems to be in some and it is just a part of
struct queue anyway, so I lifted the throttle accounting up into
struct queue. We should be able to report on the efficacy of this
patch in terms of deadlock prevention pretty soon.

--- 2.6.22.clean/block/ll_rw_blk.c 2007-07-08 16:32:17.000000000 -0700
+++ 2.6.22/block/ll_rw_blk.c 2007-08-24 12:07:16.000000000 -0700
@@ -3237,6 +3237,15 @@ end_io:
*/
void generic_make_request(struct bio *bio)
{
+ struct request_queue *q = bdev_get_queue(bio->bi_bdev);
+
+ if (q && q->metric) {
+ int need = bio->bi_reserved = q->metric(bio);
+ bio->queue = q;
+ wait_event_interruptible(q->throttle_wait, atomic_read(&q->available) >= need);
+ atomic_sub(&q->available, need);
+ }
+
if (current->bio_tail) {
/* make_request is active */
*(current->bio_tail) = bio;
--- 2.6.22.clean/drivers/md/dm.c 2007-07-08 16:32:17.000000000 -0700
+++ 2.6.22/drivers/md/dm.c 2007-08-24 12:14:23.000000000 -0700
@@ -880,6 +880,11 @@ static int dm_any_congested(void *conges
return r;
}

+static unsigned dm_metric(struct bio *bio)
+{
+ return bio->bi_vcnt;
+}
+
/*-----------------------------------------------------------------
* An IDR is used to keep track of allocated minor numbers.
*---------------------------------------------------------------*/
@@ -997,6 +1002,10 @@ static struct mapped_device *alloc_dev(i
goto bad1_free_minor;

md->queue->queuedata = md;
+ md->queue->metric = dm_metric;
+ atomic_set(&md->queue->available, md->queue->capacity = 1000);
+ init_waitqueue_head(&md->queue->throttle_wait);
+
md->queue->backing_dev_info.congested_fn = dm_any_congested;
md->queue->backing_dev_info.congested_data = md;
blk_queue_make_request(md->queue, dm_request);
--- 2.6.22.clean/fs/bio.c 2007-07-08 16:32:17.000000000 -0700
+++ 2.6.22/fs/bio.c 2007-08-24 12:10:41.000000000 -0700
@@ -1025,7 +1025,12 @@ void bio_endio(struct bio *bio, unsigned
bytes_done = bio->bi_size;
}

- bio->bi_size -= bytes_done;
+ if (!(bio->bi_size -= bytes_done) && bio->bi_reserved) {
+ struct request_queue *q = bio->queue;
+ atomic_add(&q->available, bio->bi_reserved);
+ bio->bi_reserved = 0; /* just in case */
+ wake_up(&q->throttle_wait);
+ }
bio->bi_sector += (bytes_done >> 9);

if (bio->bi_end_io)
--- 2.6.22.clean/include/linux/bio.h 2007-07-08 16:32:17.000000000 -0700
+++ 2.6.22/include/linux/bio.h 2007-08-24 11:53:51.000000000 -0700
@@ -109,6 +109,9 @@ struct bio {
bio_end_io_t *bi_end_io;
atomic_t bi_cnt; /* pin count */

+ struct request_queue *queue; /* for throttling */
+ unsigned bi_reserved; /* throttle metric */
+
void *bi_private;

bio_destructor_t *bi_destructor; /* destructor */
--- 2.6.22.clean/include/linux/blkdev.h 2007-07-08 16:32:17.000000000 -0700
+++ 2.6.22/include/linux/blkdev.h 2007-08-24 12:04:14.000000000 -0700
@@ -395,6 +395,10 @@ struct request_queue
struct work_struct unplug_work;

struct backing_dev_info backing_dev_info;
+ unsigned (*metric)(struct bio *bio); /* bio throttle metric */
+ wait_queue_head_t throttle_wait;
+ atomic_t available;
+ unsigned capacity;

/*
* The queue owner gets to use this for whatever they like.

2007-08-29 08:54:32

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [1/1] Block device throttling [Re: Distributed storage.]

On Tue, Aug 28, 2007 at 02:08:04PM -0700, Daniel Phillips ([email protected]) wrote:
> On Tuesday 28 August 2007 10:54, Evgeniy Polyakov wrote:
> > On Tue, Aug 28, 2007 at 10:27:59AM -0700, Daniel Phillips ([email protected]) wrote:
> > > > We do not care about one cpu being able to increase its counter
> > > > higher than the limit, such inaccuracy (maximum bios in flight
> > > > thus can be more than limit, difference is equal to the number of
> > > > CPUs - 1) is a price for removing atomic operation. I thought I
> > > > pointed it in the original description, but might forget, that if
> > > > it will be an issue, that atomic operations can be introduced
> > > > there. Any uber-precise measurements in the case when we are
> > > > close to the edge will not give us any benefit at all, since were
> > > > are already in the grey area.
> > >
> > > This is not just inaccurate, it is suicide. Keep leaking throttle
> > > counts and eventually all of them will be gone. No more IO
> > > on that block device!
> >
> > First, because number of increased and decreased operations are the
> > same, so it will dance around limit in both directions.
>
> No. Please go and read it the description of the race again. A count
> gets irretrievably lost because the write operation of the first
> decrement is overwritten by the second. Data gets lost. Atomic
> operations exist to prevent that sort of thing. You either need to use
> them or have a deep understanding of SMP read and write ordering in
> order to preserve data integrity by some equivalent algorithm.

I think you should complete your emotional email with decription of how
atomic types are operated and how processors access data. Just to give a
lesson to those who never knew how SMP works, but create patches and
have the conscience to send them and even discuss.
Then, if of course you will want, which I doubt, you can reread previous
mails and find that it was pointed to that race and possibilities to
solve it way too long ago.
Anyway, I prefer to look like I do not know how SMP and atomic operation
work and thus stay away from this discussion.

> --- 2.6.22.clean/block/ll_rw_blk.c 2007-07-08 16:32:17.000000000 -0700
> +++ 2.6.22/block/ll_rw_blk.c 2007-08-24 12:07:16.000000000 -0700
> @@ -3237,6 +3237,15 @@ end_io:
> */
> void generic_make_request(struct bio *bio)
> {
> + struct request_queue *q = bdev_get_queue(bio->bi_bdev);
> +
> + if (q && q->metric) {
> + int need = bio->bi_reserved = q->metric(bio);
> + bio->queue = q;

In case you have stacked device, this entry will be rewritten and you
will lost all your account data.

> + wait_event_interruptible(q->throttle_wait, atomic_read(&q->available) >= need);
> + atomic_sub(&q->available, need);
> + }

--
Evgeniy Polyakov

2007-08-30 23:21:10

by Daniel Phillips

[permalink] [raw]
Subject: Re: [1/1] Block device throttling [Re: Distributed storage.]

On Wednesday 29 August 2007 01:53, Evgeniy Polyakov wrote:
> Then, if of course you will want, which I doubt, you can reread
> previous mails and find that it was pointed to that race and
> possibilities to solve it way too long ago.

What still bothers me about your response is that, while you know the
race exists and do not disagree with my example, you don't seem to see
that that race can eventually lock up the block device by repeatedly
losing throttle counts which are never recovered. What prevents that?

> > --- 2.6.22.clean/block/ll_rw_blk.c 2007-07-08 16:32:17.000000000
> > -0700 +++ 2.6.22/block/ll_rw_blk.c 2007-08-24 12:07:16.000000000
> > -0700 @@ -3237,6 +3237,15 @@ end_io:
> > */
> > void generic_make_request(struct bio *bio)
> > {
> > + struct request_queue *q = bdev_get_queue(bio->bi_bdev);
> > +
> > + if (q && q->metric) {
> > + int need = bio->bi_reserved = q->metric(bio);
> > + bio->queue = q;
>
> In case you have stacked device, this entry will be rewritten and you
> will lost all your account data.

It is a weakness all right. Well,

- if (q && q->metric) {
+ if (q && q->metric && !bio->queue) {

which fixes that problem. Maybe there is a better fix possible. Thanks
for the catch!

The original conception was that this block throttling would apply only
to the highest level submission of the bio, the one that crosses the
boundary between filesystem (or direct block device application) and
block layer. Resubmitting a bio or submitting a dependent bio from
inside a block driver does not need to be throttled because all
resources required to guarantee completion must have been obtained
_before_ the bio was allowed to proceed into the block layer.

The other principle we are trying to satisfy is that the throttling
should not be released until bio->endio, which I am not completely sure
about with the patch as modified above. Your earlier idea of having
the throttle protection only cover the actual bio submission is
interesting and may be effective in some cases, in fact it may cover
the specific case of ddsnap. But we don't have to look any further
than ddraid (distributed raid) to find a case it doesn't cover - the
additional memory allocated to hold parity data has to be reserved
until parity data is deallocated, long after the submission completes.
So while you manage to avoid some logistical difficulties, it also looks
like you didn't solve the general problem.

Hopefully I will be able to report on whether my patch actually works
soon, when I get back from vacation. The mechanism in ddsnap this is
supposed to replace is effective, it is just ugly and tricky to verify.

Regards,

Daniel

2007-08-31 17:37:43

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [1/1] Block device throttling [Re: Distributed storage.]

Hi Daniel.

On Thu, Aug 30, 2007 at 04:20:35PM -0700, Daniel Phillips ([email protected]) wrote:
> On Wednesday 29 August 2007 01:53, Evgeniy Polyakov wrote:
> > Then, if of course you will want, which I doubt, you can reread
> > previous mails and find that it was pointed to that race and
> > possibilities to solve it way too long ago.
>
> What still bothers me about your response is that, while you know the
> race exists and do not disagree with my example, you don't seem to see
> that that race can eventually lock up the block device by repeatedly
> losing throttle counts which are never recovered. What prevents that?

I posted a trivial hack with pointed possible errors and a question
about should it be further extended (and race fixed by any of the
possible methods and so on) or new one should be developed (like in your
approach when only high level device is charged), instead I got replies
that it contains bugs, whcih will stop system and kill gene pool of the
mankind. I know how it works and where problems are. And if we are going
with this approach I will fix pointed issues.

> > > --- 2.6.22.clean/block/ll_rw_blk.c 2007-07-08 16:32:17.000000000
> > > -0700 +++ 2.6.22/block/ll_rw_blk.c 2007-08-24 12:07:16.000000000
> > > -0700 @@ -3237,6 +3237,15 @@ end_io:
> > > */
> > > void generic_make_request(struct bio *bio)
> > > {
> > > + struct request_queue *q = bdev_get_queue(bio->bi_bdev);
> > > +
> > > + if (q && q->metric) {
> > > + int need = bio->bi_reserved = q->metric(bio);
> > > + bio->queue = q;
> >
> > In case you have stacked device, this entry will be rewritten and you
> > will lost all your account data.
>
> It is a weakness all right. Well,
>
> - if (q && q->metric) {
> + if (q && q->metric && !bio->queue) {
>
> which fixes that problem. Maybe there is a better fix possible. Thanks
> for the catch!

Yes, it should.

> The original conception was that this block throttling would apply only
> to the highest level submission of the bio, the one that crosses the
> boundary between filesystem (or direct block device application) and
> block layer. Resubmitting a bio or submitting a dependent bio from
> inside a block driver does not need to be throttled because all
> resources required to guarantee completion must have been obtained
> _before_ the bio was allowed to proceed into the block layer.

We still did not come to the conclusion, but I do not want to start a
flamewar, you believe that throttling must be done on the top level
device, so you need to extend bio and convince others that idea worth
it.

> The other principle we are trying to satisfy is that the throttling
> should not be released until bio->endio, which I am not completely sure
> about with the patch as modified above. Your earlier idea of having
> the throttle protection only cover the actual bio submission is
> interesting and may be effective in some cases, in fact it may cover
> the specific case of ddsnap. But we don't have to look any further
> than ddraid (distributed raid) to find a case it doesn't cover - the
> additional memory allocated to hold parity data has to be reserved
> until parity data is deallocated, long after the submission completes.
> So while you manage to avoid some logistical difficulties, it also looks
> like you didn't solve the general problem.

Block layer does not know and should not be bothered with underlying
device nature - if you think that in endio callback limit should not be
rechardged, then provide your own layer on top of bio and thus call
endio callback only when you think it is ready to be completed.

> Hopefully I will be able to report on whether my patch actually works
> soon, when I get back from vacation. The mechanism in ddsnap this is
> supposed to replace is effective, it is just ugly and tricky to verify.
>
> Regards,
>
> Daniel

--
Evgeniy Polyakov

2007-08-31 21:43:55

by Alasdair G Kergon

[permalink] [raw]
Subject: Re: [1/1] Block device throttling [Re: Distributed storage.]

On Thu, Aug 30, 2007 at 04:20:35PM -0700, Daniel Phillips wrote:
> Resubmitting a bio or submitting a dependent bio from
> inside a block driver does not need to be throttled because all
> resources required to guarantee completion must have been obtained
> _before_ the bio was allowed to proceed into the block layer.

I'm toying with the idea of keeping track of the maximum device stack
depth for each stacked device, and only permitting it to increase in
controlled circumstances.

Alasdair
--
[email protected]

2007-09-02 04:43:29

by Daniel Phillips

[permalink] [raw]
Subject: Re: [1/1] Block device throttling [Re: Distributed storage.]

On Friday 31 August 2007 14:41, Alasdair G Kergon wrote:
> On Thu, Aug 30, 2007 at 04:20:35PM -0700, Daniel Phillips wrote:
> > Resubmitting a bio or submitting a dependent bio from
> > inside a block driver does not need to be throttled because all
> > resources required to guarantee completion must have been obtained
> > _before_ the bio was allowed to proceed into the block layer.
>
> I'm toying with the idea of keeping track of the maximum device stack
> depth for each stacked device, and only permitting it to increase in
> controlled circumstances.

Hi Alasdair,

What kind of circumstances did you have in mind?

Regards,

Daniel