Hi,
If you are not already tired of so many io controller implementations, here
is another one.
This is a very eary very crude implementation to get early feedback to see
if this approach makes any sense or not.
This controller is a proportional weight IO controller primarily
based on/inspired by dm-ioband. One of the things I personally found little
odd about dm-ioband was need of a dm-ioband device for every device we want
to control. I thought that probably we can make this control per request
queue and get rid of device mapper driver. This should make configuration
aspect easy.
I have picked up quite some amount of code from dm-ioband especially for
biocgroup implementation.
I have done very basic testing and that is running 2-3 dd commands in different
cgroups on x86_64. Wanted to throw out the code early to get some feedback.
More details about the design and how to are in documentation patch.
Your comments are welcome.
Thanks
Vivek
--
On Thu, 2008-11-06 at 10:30 -0500, [email protected] wrote:
> Hi,
>
> If you are not already tired of so many io controller implementations, here
> is another one.
>
> This is a very eary very crude implementation to get early feedback to see
> if this approach makes any sense or not.
>
> This controller is a proportional weight IO controller primarily
> based on/inspired by dm-ioband. One of the things I personally found little
> odd about dm-ioband was need of a dm-ioband device for every device we want
> to control. I thought that probably we can make this control per request
> queue and get rid of device mapper driver. This should make configuration
> aspect easy.
>
> I have picked up quite some amount of code from dm-ioband especially for
> biocgroup implementation.
>
> I have done very basic testing and that is running 2-3 dd commands in different
> cgroups on x86_64. Wanted to throw out the code early to get some feedback.
>
> More details about the design and how to are in documentation patch.
>
> Your comments are welcome.
please include
QUILT_REFRESH_ARGS="--diffstat --strip-trailing-whitespace"
in your environment or .quiltrc
I would expect all those bio* files to be placed in block/ not mm/
Does this still require I use dm, or does it also work on regular block
devices? Patch 4/4 isn't quite clear on this.
On Thu, Nov 06, 2008 at 04:49:53PM +0100, Peter Zijlstra wrote:
> On Thu, 2008-11-06 at 10:30 -0500, [email protected] wrote:
> > Hi,
> >
> > If you are not already tired of so many io controller implementations, here
> > is another one.
> >
> > This is a very eary very crude implementation to get early feedback to see
> > if this approach makes any sense or not.
> >
> > This controller is a proportional weight IO controller primarily
> > based on/inspired by dm-ioband. One of the things I personally found little
> > odd about dm-ioband was need of a dm-ioband device for every device we want
> > to control. I thought that probably we can make this control per request
> > queue and get rid of device mapper driver. This should make configuration
> > aspect easy.
> >
> > I have picked up quite some amount of code from dm-ioband especially for
> > biocgroup implementation.
> >
> > I have done very basic testing and that is running 2-3 dd commands in different
> > cgroups on x86_64. Wanted to throw out the code early to get some feedback.
> >
> > More details about the design and how to are in documentation patch.
> >
> > Your comments are welcome.
>
> please include
>
> QUILT_REFRESH_ARGS="--diffstat --strip-trailing-whitespace"
>
> in your environment or .quiltrc
>
Sure, I will do. First time user of quilt. :-)
> I would expect all those bio* files to be placed in block/ not mm/
>
Thinking more about it, probably block/ will be more appropriate place.
I will do that.
> Does this still require I use dm, or does it also work on regular block
> devices? Patch 4/4 isn't quite clear on this.
No. You don't have to use dm. It will simply work on regular devices. We
shall have to put few lines of code for it to work on devices which don't
make use of standard __make_request() function and provide their own
make_request function.
Hence for example, I have put that few lines of code so that it can work
with dm device. I shall have to do something similar for md too.
Though, I am not very sure why do I need to do IO control on higher level
devices. Will it be sufficient if we just control only bottom most
physical block devices?
Anyway, this approach should work at any level.
Thanks
Vivek
On Thu, 2008-11-06 at 11:01 -0500, Vivek Goyal wrote:
> > Does this still require I use dm, or does it also work on regular block
> > devices? Patch 4/4 isn't quite clear on this.
>
> No. You don't have to use dm. It will simply work on regular devices. We
> shall have to put few lines of code for it to work on devices which don't
> make use of standard __make_request() function and provide their own
> make_request function.
>
> Hence for example, I have put that few lines of code so that it can work
> with dm device. I shall have to do something similar for md too.
>
> Though, I am not very sure why do I need to do IO control on higher level
> devices. Will it be sufficient if we just control only bottom most
> physical block devices?
>
> Anyway, this approach should work at any level.
Nice, although I would think only doing the higher level devices makes
more sense than only doing the leafs.
Is there any reason we cannot merge this with the regular io-scheduler
interface? afaik the only problem with doing group scheduling in the
io-schedulers is the stacked devices issue.
Could we make the io-schedulers aware of this hierarchy?
On Thu, Nov 06, 2008 at 05:16:13PM +0100, Peter Zijlstra wrote:
> On Thu, 2008-11-06 at 11:01 -0500, Vivek Goyal wrote:
>
> > > Does this still require I use dm, or does it also work on regular block
> > > devices? Patch 4/4 isn't quite clear on this.
> >
> > No. You don't have to use dm. It will simply work on regular devices. We
> > shall have to put few lines of code for it to work on devices which don't
> > make use of standard __make_request() function and provide their own
> > make_request function.
> >
> > Hence for example, I have put that few lines of code so that it can work
> > with dm device. I shall have to do something similar for md too.
> >
> > Though, I am not very sure why do I need to do IO control on higher level
> > devices. Will it be sufficient if we just control only bottom most
> > physical block devices?
> >
> > Anyway, this approach should work at any level.
>
> Nice, although I would think only doing the higher level devices makes
> more sense than only doing the leafs.
>
I thought that we should be doing any kind of resource management only at
the level where there is actual contention for the resources.So in this case
looks like only bottom most devices are slow and don't have infinite bandwidth
hence the contention.(I am not taking into account the contention at
bus level or contention at interconnect level for external storage,
assuming interconnect is not the bottleneck).
For example, lets say there is one linear device mapper device dm-0 on
top of physical devices sda and sdb. Assuming two tasks in two different
cgroups are reading two different files from deivce dm-0. Now if these
files both fall on same physical device (either sda or sdb), then they
will be contending for resources. But if files being read are on different
physical deivces then practically there is no device contention (Even on
the surface it might look like that dm-0 is being contended for). So if
files are on different physical devices, IO controller will not know it.
He will simply dispatch one group at a time and other device might remain
idle.
Keeping that in mind I thought we will be able to make use of full
available bandwidth if we do IO control only at bottom most device. Doing
it at higher layer has potential of not making use of full available bandwidth.
> Is there any reason we cannot merge this with the regular io-scheduler
> interface? afaik the only problem with doing group scheduling in the
> io-schedulers is the stacked devices issue.
I think we should be able to merge it with regular io schedulers. Apart
from stacked device issue, people also mentioned that it is so closely
tied to IO schedulers that we will end up doing four implementations for
four schedulers and that is not very good from maintenance perspective.
But I will spend more time in finding out if there is a common ground
between schedulers so that a lot of common IO control code can be used
in all the schedulers.
>
> Could we make the io-schedulers aware of this hierarchy?
You mean IO schedulers knowing that there is somebody above them doing
proportional weight dispatching of bios? If yes, how would that help?
Thanks
Vivek
Peter Zijlstra wrote:
> Nice, although I would think only doing the higher level devices makes
> more sense than only doing the leafs.
I'm not convinced.
Say that you have two resource groups on a bunch of LVM
volumes across two disks.
If one of the resource groups only sends requests to one
of the disks, the other resource group should be able to
get all of its requests through immediateley at the other
disk.
Holding up the second resource group's requests could
result in a disk being idle. Worse, once that cgroup's
requests finally make it through, the other cgroup might
also want to use the disk and they both get slowed down.
When a resource is uncontended, should a potential user
be made to wait?
--
All rights reversed.
On Thu, 2008-11-06 at 11:39 -0500, Vivek Goyal wrote:
> On Thu, Nov 06, 2008 at 05:16:13PM +0100, Peter Zijlstra wrote:
> > On Thu, 2008-11-06 at 11:01 -0500, Vivek Goyal wrote:
> >
> > > > Does this still require I use dm, or does it also work on regular block
> > > > devices? Patch 4/4 isn't quite clear on this.
> > >
> > > No. You don't have to use dm. It will simply work on regular devices. We
> > > shall have to put few lines of code for it to work on devices which don't
> > > make use of standard __make_request() function and provide their own
> > > make_request function.
> > >
> > > Hence for example, I have put that few lines of code so that it can work
> > > with dm device. I shall have to do something similar for md too.
> > >
> > > Though, I am not very sure why do I need to do IO control on higher level
> > > devices. Will it be sufficient if we just control only bottom most
> > > physical block devices?
> > >
> > > Anyway, this approach should work at any level.
> >
> > Nice, although I would think only doing the higher level devices makes
> > more sense than only doing the leafs.
> >
>
> I thought that we should be doing any kind of resource management only at
> the level where there is actual contention for the resources.So in this case
> looks like only bottom most devices are slow and don't have infinite bandwidth
> hence the contention.(I am not taking into account the contention at
> bus level or contention at interconnect level for external storage,
> assuming interconnect is not the bottleneck).
>
> For example, lets say there is one linear device mapper device dm-0 on
> top of physical devices sda and sdb. Assuming two tasks in two different
> cgroups are reading two different files from deivce dm-0. Now if these
> files both fall on same physical device (either sda or sdb), then they
> will be contending for resources. But if files being read are on different
> physical deivces then practically there is no device contention (Even on
> the surface it might look like that dm-0 is being contended for). So if
> files are on different physical devices, IO controller will not know it.
> He will simply dispatch one group at a time and other device might remain
> idle.
>
> Keeping that in mind I thought we will be able to make use of full
> available bandwidth if we do IO control only at bottom most device. Doing
> it at higher layer has potential of not making use of full available bandwidth.
>
> > Is there any reason we cannot merge this with the regular io-scheduler
> > interface? afaik the only problem with doing group scheduling in the
> > io-schedulers is the stacked devices issue.
>
> I think we should be able to merge it with regular io schedulers. Apart
> from stacked device issue, people also mentioned that it is so closely
> tied to IO schedulers that we will end up doing four implementations for
> four schedulers and that is not very good from maintenance perspective.
>
> But I will spend more time in finding out if there is a common ground
> between schedulers so that a lot of common IO control code can be used
> in all the schedulers.
>
> >
> > Could we make the io-schedulers aware of this hierarchy?
>
> You mean IO schedulers knowing that there is somebody above them doing
> proportional weight dispatching of bios? If yes, how would that help?
Well, take the slightly more elaborate example or a raid[56] setup. This
will need to sometimes issue multiple leaf level ios to satisfy one top
level io.
How are you going to attribute this fairly?
I don't think the issue of bandwidth availability like above will really
be an issue, if your stripe is set up symmetrically, the contention
should average out to both (all) disks in equal measures.
The only real issue I can see is with linear volumes, but those are
stupid anyway - non of the gains but all the risks.
Peter Zijlstra wrote:
> The only real issue I can see is with linear volumes, but those are
> stupid anyway - non of the gains but all the risks.
Linear volumes may well be the most common ones.
People start out with the filesystems at a certain size,
increasing onto a second (new) disk later, when more space
is required.
--
All rights reversed.
On Thu, Nov 06, 2008 at 05:52:07PM +0100, Peter Zijlstra wrote:
> On Thu, 2008-11-06 at 11:39 -0500, Vivek Goyal wrote:
> > On Thu, Nov 06, 2008 at 05:16:13PM +0100, Peter Zijlstra wrote:
> > > On Thu, 2008-11-06 at 11:01 -0500, Vivek Goyal wrote:
> > >
> > > > > Does this still require I use dm, or does it also work on regular block
> > > > > devices? Patch 4/4 isn't quite clear on this.
> > > >
> > > > No. You don't have to use dm. It will simply work on regular devices. We
> > > > shall have to put few lines of code for it to work on devices which don't
> > > > make use of standard __make_request() function and provide their own
> > > > make_request function.
> > > >
> > > > Hence for example, I have put that few lines of code so that it can work
> > > > with dm device. I shall have to do something similar for md too.
> > > >
> > > > Though, I am not very sure why do I need to do IO control on higher level
> > > > devices. Will it be sufficient if we just control only bottom most
> > > > physical block devices?
> > > >
> > > > Anyway, this approach should work at any level.
> > >
> > > Nice, although I would think only doing the higher level devices makes
> > > more sense than only doing the leafs.
> > >
> >
> > I thought that we should be doing any kind of resource management only at
> > the level where there is actual contention for the resources.So in this case
> > looks like only bottom most devices are slow and don't have infinite bandwidth
> > hence the contention.(I am not taking into account the contention at
> > bus level or contention at interconnect level for external storage,
> > assuming interconnect is not the bottleneck).
> >
> > For example, lets say there is one linear device mapper device dm-0 on
> > top of physical devices sda and sdb. Assuming two tasks in two different
> > cgroups are reading two different files from deivce dm-0. Now if these
> > files both fall on same physical device (either sda or sdb), then they
> > will be contending for resources. But if files being read are on different
> > physical deivces then practically there is no device contention (Even on
> > the surface it might look like that dm-0 is being contended for). So if
> > files are on different physical devices, IO controller will not know it.
> > He will simply dispatch one group at a time and other device might remain
> > idle.
> >
> > Keeping that in mind I thought we will be able to make use of full
> > available bandwidth if we do IO control only at bottom most device. Doing
> > it at higher layer has potential of not making use of full available bandwidth.
> >
> > > Is there any reason we cannot merge this with the regular io-scheduler
> > > interface? afaik the only problem with doing group scheduling in the
> > > io-schedulers is the stacked devices issue.
> >
> > I think we should be able to merge it with regular io schedulers. Apart
> > from stacked device issue, people also mentioned that it is so closely
> > tied to IO schedulers that we will end up doing four implementations for
> > four schedulers and that is not very good from maintenance perspective.
> >
> > But I will spend more time in finding out if there is a common ground
> > between schedulers so that a lot of common IO control code can be used
> > in all the schedulers.
> >
> > >
> > > Could we make the io-schedulers aware of this hierarchy?
> >
> > You mean IO schedulers knowing that there is somebody above them doing
> > proportional weight dispatching of bios? If yes, how would that help?
>
> Well, take the slightly more elaborate example or a raid[56] setup. This
> will need to sometimes issue multiple leaf level ios to satisfy one top
> level io.
>
> How are you going to attribute this fairly?
>
I think in this case, definition of fair allocation will be little
different. We will do fair allocation only at the leaf nodes where
there is actual contention, irrespective of higher level setup.
So if higher level block device issues multiple ios to satisfy one top
level io, we will actually do the bandwidth allocation only on
those multiple ios because that's the real IO contending for disk
bandwidth. And if these multiple ios are going to different physical
devices, then contention management will take place on those devices.
IOW, we will not worry about providing fairness at bios submitted to
higher level devices. We will just pitch in for contention management
only when request from various cgroups are contending for physical
device at bottom most layers. Isn't if fair?
Thanks
Vivek
> I don't think the issue of bandwidth availability like above will really
> be an issue, if your stripe is set up symmetrically, the contention
> should average out to both (all) disks in equal measures.
>
> The only real issue I can see is with linear volumes, but those are
> stupid anyway - non of the gains but all the risks.
On Thu, 2008-11-06 at 11:57 -0500, Rik van Riel wrote:
> Peter Zijlstra wrote:
>
> > The only real issue I can see is with linear volumes, but those are
> > stupid anyway - non of the gains but all the risks.
>
> Linear volumes may well be the most common ones.
>
> People start out with the filesystems at a certain size,
> increasing onto a second (new) disk later, when more space
> is required.
Are they aware of how risky linear volumes are? I would discourage
anyone from using them.
It seems that approaches with two level scheduling (DM-IOBand or this
patch set on top and another scheduler at elevator) will have the
possibility of undesirable interactions (see "issues" listed at the
end of the second patch). For example, a request submitted as RT might
get delayed at higher layers, even if cfq at elevator level is doing
the right thing.
Moreover, if the requests in the higher level scheduler are dispatched
as soon as they come, there would be no queuing at the higher layers,
unless the request queue at the lower level fills up and causes a
backlog. And in the absence of queuing, any work-conserving scheduler
would behave as a no-op scheduler.
These issues motivate to take a second look into two level scheduling.
The main motivations for two level scheduling seem to be:
(1) Support bandwidth division across multiple devices for RAID and LVMs.
(2) Divide bandwidth between different cgroups without modifying each
of the existing schedulers (and without replicating the code).
One possible approach to handle (1) is to keep track of bandwidth
utilized by each cgroup in a per cgroup data structure (instead of a
per cgroup per device data structure) and use that information to make
scheduling decisions within the elevator level schedulers. Such a
patch can be made flag-disabled if co-ordination across different
device schedulers is not required.
And (2) can probably be handled by having one scheduler support
different modes. For example, one possible mode is "propotional
division between crgroups + no-op between threads of a cgroup" or "cfq
between cgroups + cfq between threads of a cgroup". That would also
help avoid combinations which might not work e.g RT request issue
mentioned earlier in this email. And this unified scheduler can re-use
code from all the existing patches.
Thanks.
--
Nauman
On Thu, Nov 6, 2008 at 9:08 AM, Vivek Goyal <[email protected]> wrote:
> On Thu, Nov 06, 2008 at 05:52:07PM +0100, Peter Zijlstra wrote:
>> On Thu, 2008-11-06 at 11:39 -0500, Vivek Goyal wrote:
>> > On Thu, Nov 06, 2008 at 05:16:13PM +0100, Peter Zijlstra wrote:
>> > > On Thu, 2008-11-06 at 11:01 -0500, Vivek Goyal wrote:
>> > >
>> > > > > Does this still require I use dm, or does it also work on regular block
>> > > > > devices? Patch 4/4 isn't quite clear on this.
>> > > >
>> > > > No. You don't have to use dm. It will simply work on regular devices. We
>> > > > shall have to put few lines of code for it to work on devices which don't
>> > > > make use of standard __make_request() function and provide their own
>> > > > make_request function.
>> > > >
>> > > > Hence for example, I have put that few lines of code so that it can work
>> > > > with dm device. I shall have to do something similar for md too.
>> > > >
>> > > > Though, I am not very sure why do I need to do IO control on higher level
>> > > > devices. Will it be sufficient if we just control only bottom most
>> > > > physical block devices?
>> > > >
>> > > > Anyway, this approach should work at any level.
>> > >
>> > > Nice, although I would think only doing the higher level devices makes
>> > > more sense than only doing the leafs.
>> > >
>> >
>> > I thought that we should be doing any kind of resource management only at
>> > the level where there is actual contention for the resources.So in this case
>> > looks like only bottom most devices are slow and don't have infinite bandwidth
>> > hence the contention.(I am not taking into account the contention at
>> > bus level or contention at interconnect level for external storage,
>> > assuming interconnect is not the bottleneck).
>> >
>> > For example, lets say there is one linear device mapper device dm-0 on
>> > top of physical devices sda and sdb. Assuming two tasks in two different
>> > cgroups are reading two different files from deivce dm-0. Now if these
>> > files both fall on same physical device (either sda or sdb), then they
>> > will be contending for resources. But if files being read are on different
>> > physical deivces then practically there is no device contention (Even on
>> > the surface it might look like that dm-0 is being contended for). So if
>> > files are on different physical devices, IO controller will not know it.
>> > He will simply dispatch one group at a time and other device might remain
>> > idle.
>> >
>> > Keeping that in mind I thought we will be able to make use of full
>> > available bandwidth if we do IO control only at bottom most device. Doing
>> > it at higher layer has potential of not making use of full available bandwidth.
>> >
>> > > Is there any reason we cannot merge this with the regular io-scheduler
>> > > interface? afaik the only problem with doing group scheduling in the
>> > > io-schedulers is the stacked devices issue.
>> >
>> > I think we should be able to merge it with regular io schedulers. Apart
>> > from stacked device issue, people also mentioned that it is so closely
>> > tied to IO schedulers that we will end up doing four implementations for
>> > four schedulers and that is not very good from maintenance perspective.
>> >
>> > But I will spend more time in finding out if there is a common ground
>> > between schedulers so that a lot of common IO control code can be used
>> > in all the schedulers.
>> >
>> > >
>> > > Could we make the io-schedulers aware of this hierarchy?
>> >
>> > You mean IO schedulers knowing that there is somebody above them doing
>> > proportional weight dispatching of bios? If yes, how would that help?
>>
>> Well, take the slightly more elaborate example or a raid[56] setup. This
>> will need to sometimes issue multiple leaf level ios to satisfy one top
>> level io.
>>
>> How are you going to attribute this fairly?
>>
>
> I think in this case, definition of fair allocation will be little
> different. We will do fair allocation only at the leaf nodes where
> there is actual contention, irrespective of higher level setup.
>
> So if higher level block device issues multiple ios to satisfy one top
> level io, we will actually do the bandwidth allocation only on
> those multiple ios because that's the real IO contending for disk
> bandwidth. And if these multiple ios are going to different physical
> devices, then contention management will take place on those devices.
>
> IOW, we will not worry about providing fairness at bios submitted to
> higher level devices. We will just pitch in for contention management
> only when request from various cgroups are contending for physical
> device at bottom most layers. Isn't if fair?
>
> Thanks
> Vivek
>
>> I don't think the issue of bandwidth availability like above will really
>> be an issue, if your stripe is set up symmetrically, the contention
>> should average out to both (all) disks in equal measures.
>>
>> The only real issue I can see is with linear volumes, but those are
>> stupid anyway - non of the gains but all the risks.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
On Thu, Nov 06, 2008 at 06:11:27PM +0100, Peter Zijlstra wrote:
> On Thu, 2008-11-06 at 11:57 -0500, Rik van Riel wrote:
> > Peter Zijlstra wrote:
> >
> > > The only real issue I can see is with linear volumes, but those are
> > > stupid anyway - non of the gains but all the risks.
> >
> > Linear volumes may well be the most common ones.
> >
> > People start out with the filesystems at a certain size,
> > increasing onto a second (new) disk later, when more space
> > is required.
>
> Are they aware of how risky linear volumes are? I would discourage
> anyone from using them.
In what way are they risky?
Cheers,
Dave.
--
Dave Chinner
[email protected]
[email protected] wrote:
> Hi,
>
> If you are not already tired of so many io controller implementations, here
> is another one.
>
> This is a very eary very crude implementation to get early feedback to see
> if this approach makes any sense or not.
>
> This controller is a proportional weight IO controller primarily
> based on/inspired by dm-ioband. One of the things I personally found little
> odd about dm-ioband was need of a dm-ioband device for every device we want
> to control. I thought that probably we can make this control per request
> queue and get rid of device mapper driver. This should make configuration
> aspect easy.
>
> I have picked up quite some amount of code from dm-ioband especially for
> biocgroup implementation.
>
> I have done very basic testing and that is running 2-3 dd commands in different
> cgroups on x86_64. Wanted to throw out the code early to get some feedback.
>
> More details about the design and how to are in documentation patch.
>
> Your comments are welcome.
Which kernel version is this patch set based on?
>
> Thanks
> Vivek
>
--
Regards
Gui Jianfeng
On Fri, 2008-11-07 at 11:41 +1100, Dave Chinner wrote:
> On Thu, Nov 06, 2008 at 06:11:27PM +0100, Peter Zijlstra wrote:
> > On Thu, 2008-11-06 at 11:57 -0500, Rik van Riel wrote:
> > > Peter Zijlstra wrote:
> > >
> > > > The only real issue I can see is with linear volumes, but those are
> > > > stupid anyway - non of the gains but all the risks.
> > >
> > > Linear volumes may well be the most common ones.
> > >
> > > People start out with the filesystems at a certain size,
> > > increasing onto a second (new) disk later, when more space
> > > is required.
> >
> > Are they aware of how risky linear volumes are? I would discourage
> > anyone from using them.
>
> In what way are they risky?
You loose all your data when one disk dies, so your mtbf decreases with
the number of disks in your linear span.
And you get non of the benefits from having multiple disks, like extra
speed from striping, or redundancy from raid.
Therefore I say that linear volumes are the absolute worst choice.
On Fri, Nov 07, 2008 at 10:36:50AM +0800, Gui Jianfeng wrote:
> [email protected] wrote:
> > Hi,
> >
> > If you are not already tired of so many io controller implementations, here
> > is another one.
> >
> > This is a very eary very crude implementation to get early feedback to see
> > if this approach makes any sense or not.
> >
> > This controller is a proportional weight IO controller primarily
> > based on/inspired by dm-ioband. One of the things I personally found little
> > odd about dm-ioband was need of a dm-ioband device for every device we want
> > to control. I thought that probably we can make this control per request
> > queue and get rid of device mapper driver. This should make configuration
> > aspect easy.
> >
> > I have picked up quite some amount of code from dm-ioband especially for
> > biocgroup implementation.
> >
> > I have done very basic testing and that is running 2-3 dd commands in different
> > cgroups on x86_64. Wanted to throw out the code early to get some feedback.
> >
> > More details about the design and how to are in documentation patch.
> >
> > Your comments are welcome.
>
> Which kernel version is this patch set based on?
>
2.6.27
Thanks
Vivek
On Thu, Nov 06, 2008 at 03:07:57PM -0800, Nauman Rafique wrote:
> It seems that approaches with two level scheduling (DM-IOBand or this
> patch set on top and another scheduler at elevator) will have the
> possibility of undesirable interactions (see "issues" listed at the
> end of the second patch). For example, a request submitted as RT might
> get delayed at higher layers, even if cfq at elevator level is doing
> the right thing.
>
Yep. Buffering of bios at higher layer can break underlying elevator's
assumptions.
What if we start keeping track of task priorities and RT tasks in higher
level schedulers and dispatch the bios accordingly. Will it break the
underlying noop, deadline or AS?
> Moreover, if the requests in the higher level scheduler are dispatched
> as soon as they come, there would be no queuing at the higher layers,
> unless the request queue at the lower level fills up and causes a
> backlog. And in the absence of queuing, any work-conserving scheduler
> would behave as a no-op scheduler.
>
> These issues motivate to take a second look into two level scheduling.
> The main motivations for two level scheduling seem to be:
> (1) Support bandwidth division across multiple devices for RAID and LVMs.
Nauman, can you give an example where we really need bandwidth division
for higher level devices.
I am beginning to think that real contention is at leaf level physical
devices and not at higher level logical devices hence we should be doing
any resource management only at leaf level and not worry about higher
level logical devices.
If this requirement goes away, then case of two level scheduler weakens
and one needs to think about doing changes at leaf level IO schedulers.
> (2) Divide bandwidth between different cgroups without modifying each
> of the existing schedulers (and without replicating the code).
>
> One possible approach to handle (1) is to keep track of bandwidth
> utilized by each cgroup in a per cgroup data structure (instead of a
> per cgroup per device data structure) and use that information to make
> scheduling decisions within the elevator level schedulers. Such a
> patch can be made flag-disabled if co-ordination across different
> device schedulers is not required.
>
Can you give more details about it. I am not sure I understand it. Exactly
what information should be stored in each cgroup.
I think per cgroup per device data structures are good so that an scheduer
will not worry about other devices present in the system and will just try
to arbitrate between various cgroup contending for that device. This goes
back to same issue of getting rid of requirement (1) from io controller.
> And (2) can probably be handled by having one scheduler support
> different modes. For example, one possible mode is "propotional
> division between crgroups + no-op between threads of a cgroup" or "cfq
> between cgroups + cfq between threads of a cgroup". That would also
> help avoid combinations which might not work e.g RT request issue
> mentioned earlier in this email. And this unified scheduler can re-use
> code from all the existing patches.
>
IIUC, you are suggesting some kind of unification between four IO
schedulers so that proportional weight code is not replicated and user can
switch mode on the fly based on tunables?
Thanks
Vivek
> Thanks.
> --
> Nauman
>
> On Thu, Nov 6, 2008 at 9:08 AM, Vivek Goyal <[email protected]> wrote:
> > On Thu, Nov 06, 2008 at 05:52:07PM +0100, Peter Zijlstra wrote:
> >> On Thu, 2008-11-06 at 11:39 -0500, Vivek Goyal wrote:
> >> > On Thu, Nov 06, 2008 at 05:16:13PM +0100, Peter Zijlstra wrote:
> >> > > On Thu, 2008-11-06 at 11:01 -0500, Vivek Goyal wrote:
> >> > >
> >> > > > > Does this still require I use dm, or does it also work on regular block
> >> > > > > devices? Patch 4/4 isn't quite clear on this.
> >> > > >
> >> > > > No. You don't have to use dm. It will simply work on regular devices. We
> >> > > > shall have to put few lines of code for it to work on devices which don't
> >> > > > make use of standard __make_request() function and provide their own
> >> > > > make_request function.
> >> > > >
> >> > > > Hence for example, I have put that few lines of code so that it can work
> >> > > > with dm device. I shall have to do something similar for md too.
> >> > > >
> >> > > > Though, I am not very sure why do I need to do IO control on higher level
> >> > > > devices. Will it be sufficient if we just control only bottom most
> >> > > > physical block devices?
> >> > > >
> >> > > > Anyway, this approach should work at any level.
> >> > >
> >> > > Nice, although I would think only doing the higher level devices makes
> >> > > more sense than only doing the leafs.
> >> > >
> >> >
> >> > I thought that we should be doing any kind of resource management only at
> >> > the level where there is actual contention for the resources.So in this case
> >> > looks like only bottom most devices are slow and don't have infinite bandwidth
> >> > hence the contention.(I am not taking into account the contention at
> >> > bus level or contention at interconnect level for external storage,
> >> > assuming interconnect is not the bottleneck).
> >> >
> >> > For example, lets say there is one linear device mapper device dm-0 on
> >> > top of physical devices sda and sdb. Assuming two tasks in two different
> >> > cgroups are reading two different files from deivce dm-0. Now if these
> >> > files both fall on same physical device (either sda or sdb), then they
> >> > will be contending for resources. But if files being read are on different
> >> > physical deivces then practically there is no device contention (Even on
> >> > the surface it might look like that dm-0 is being contended for). So if
> >> > files are on different physical devices, IO controller will not know it.
> >> > He will simply dispatch one group at a time and other device might remain
> >> > idle.
> >> >
> >> > Keeping that in mind I thought we will be able to make use of full
> >> > available bandwidth if we do IO control only at bottom most device. Doing
> >> > it at higher layer has potential of not making use of full available bandwidth.
> >> >
> >> > > Is there any reason we cannot merge this with the regular io-scheduler
> >> > > interface? afaik the only problem with doing group scheduling in the
> >> > > io-schedulers is the stacked devices issue.
> >> >
> >> > I think we should be able to merge it with regular io schedulers. Apart
> >> > from stacked device issue, people also mentioned that it is so closely
> >> > tied to IO schedulers that we will end up doing four implementations for
> >> > four schedulers and that is not very good from maintenance perspective.
> >> >
> >> > But I will spend more time in finding out if there is a common ground
> >> > between schedulers so that a lot of common IO control code can be used
> >> > in all the schedulers.
> >> >
> >> > >
> >> > > Could we make the io-schedulers aware of this hierarchy?
> >> >
> >> > You mean IO schedulers knowing that there is somebody above them doing
> >> > proportional weight dispatching of bios? If yes, how would that help?
> >>
> >> Well, take the slightly more elaborate example or a raid[56] setup. This
> >> will need to sometimes issue multiple leaf level ios to satisfy one top
> >> level io.
> >>
> >> How are you going to attribute this fairly?
> >>
> >
> > I think in this case, definition of fair allocation will be little
> > different. We will do fair allocation only at the leaf nodes where
> > there is actual contention, irrespective of higher level setup.
> >
> > So if higher level block device issues multiple ios to satisfy one top
> > level io, we will actually do the bandwidth allocation only on
> > those multiple ios because that's the real IO contending for disk
> > bandwidth. And if these multiple ios are going to different physical
> > devices, then contention management will take place on those devices.
> >
> > IOW, we will not worry about providing fairness at bios submitted to
> > higher level devices. We will just pitch in for contention management
> > only when request from various cgroups are contending for physical
> > device at bottom most layers. Isn't if fair?
> >
> > Thanks
> > Vivek
> >
> >> I don't think the issue of bandwidth availability like above will really
> >> be an issue, if your stripe is set up symmetrically, the contention
> >> should average out to both (all) disks in equal measures.
> >>
> >> The only real issue I can see is with linear volumes, but those are
> >> stupid anyway - non of the gains but all the risks.
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at http://www.tux.org/lkml/
> >
On Fri, Nov 7, 2008 at 6:19 AM, Vivek Goyal <[email protected]> wrote:
> On Thu, Nov 06, 2008 at 03:07:57PM -0800, Nauman Rafique wrote:
>> It seems that approaches with two level scheduling (DM-IOBand or this
>> patch set on top and another scheduler at elevator) will have the
>> possibility of undesirable interactions (see "issues" listed at the
>> end of the second patch). For example, a request submitted as RT might
>> get delayed at higher layers, even if cfq at elevator level is doing
>> the right thing.
>>
>
> Yep. Buffering of bios at higher layer can break underlying elevator's
> assumptions.
>
> What if we start keeping track of task priorities and RT tasks in higher
> level schedulers and dispatch the bios accordingly. Will it break the
> underlying noop, deadline or AS?
It will probably not. But then we have a cfq-like scheduler at higher
level and we can agree that the combinations "cfq(higher
level)-noop(lower level)", "cfq-deadline", "cfq-as" and "cfq-cfq"
would probably work. But if we implement one high level cfq-like
scheduler at a higher level, we would not take care of somebody who
wants noop-noop or propotional-noop. The point I am trying to make is
that there is probably no single one-size-fits-all solution for a
higher level scheduler. And we should limit the arbitrary mixing and
matching of higher level schedulers and elevator schedulers. That
being said, the existence of a higher level scheduler is still a point
of debate I guess, see my comments below.
>
>> Moreover, if the requests in the higher level scheduler are dispatched
>> as soon as they come, there would be no queuing at the higher layers,
>> unless the request queue at the lower level fills up and causes a
>> backlog. And in the absence of queuing, any work-conserving scheduler
>> would behave as a no-op scheduler.
>>
>> These issues motivate to take a second look into two level scheduling.
>> The main motivations for two level scheduling seem to be:
>> (1) Support bandwidth division across multiple devices for RAID and LVMs.
>
> Nauman, can you give an example where we really need bandwidth division
> for higher level devices.
>
> I am beginning to think that real contention is at leaf level physical
> devices and not at higher level logical devices hence we should be doing
> any resource management only at leaf level and not worry about higher
> level logical devices.
>
> If this requirement goes away, then case of two level scheduler weakens
> and one needs to think about doing changes at leaf level IO schedulers.
I cannot agree with you more on this that there is only contention at
the leaf level physical devices and bandwidth should be managed only
there. But having seen earlier posts on this list, i feel some folks
might not agree with us. For example, if we have RAID-0 striping, we
might want to schedule requests based on accumulative bandwidth used
over all devices. Again, I myself don't agree with moving scheduling
at a higher level just to support that.
>
>> (2) Divide bandwidth between different cgroups without modifying each
>> of the existing schedulers (and without replicating the code).
>>
>> One possible approach to handle (1) is to keep track of bandwidth
>> utilized by each cgroup in a per cgroup data structure (instead of a
>> per cgroup per device data structure) and use that information to make
>> scheduling decisions within the elevator level schedulers. Such a
>> patch can be made flag-disabled if co-ordination across different
>> device schedulers is not required.
>>
>
> Can you give more details about it. I am not sure I understand it. Exactly
> what information should be stored in each cgroup.
>
> I think per cgroup per device data structures are good so that an scheduer
> will not worry about other devices present in the system and will just try
> to arbitrate between various cgroup contending for that device. This goes
> back to same issue of getting rid of requirement (1) from io controller.
I was thinking that we can keep track of disk time used at each
device, and keep the cumulative number in a per cgroup data structure.
But that is only if we want to support bandwidth division across
devices. You and me both agree that we probably do not need to do
that.
>
>> And (2) can probably be handled by having one scheduler support
>> different modes. For example, one possible mode is "propotional
>> division between crgroups + no-op between threads of a cgroup" or "cfq
>> between cgroups + cfq between threads of a cgroup". That would also
>> help avoid combinations which might not work e.g RT request issue
>> mentioned earlier in this email. And this unified scheduler can re-use
>> code from all the existing patches.
>>
>
> IIUC, you are suggesting some kind of unification between four IO
> schedulers so that proportional weight code is not replicated and user can
> switch mode on the fly based on tunables?
Yes, that seems to be a solution to avoid replication of code. But we
should also look at any other solutions that avoid replication of
code, and also avoid scheduling in two different layers.
In my opinion, scheduling at two different layers is problematic because
(a) Any buffering done at a higher level will be artificial, unless
the queues at lower levels are completely full. And if there is no
buffering at a higher level, any scheduling scheme would be
ineffective.
(b) We cannot have an arbitrary mixing and matching of higher and
lower level schedulers.
(a) would exist in any solution in which requests are queued at
multiple levels. Can you please comment on this with respect to the
patch that you have posted?
Thanks.
--
Nauman
>
> Thanks
> Vivek
>
>> Thanks.
>> --
>> Nauman
>>
>> On Thu, Nov 6, 2008 at 9:08 AM, Vivek Goyal <[email protected]> wrote:
>> > On Thu, Nov 06, 2008 at 05:52:07PM +0100, Peter Zijlstra wrote:
>> >> On Thu, 2008-11-06 at 11:39 -0500, Vivek Goyal wrote:
>> >> > On Thu, Nov 06, 2008 at 05:16:13PM +0100, Peter Zijlstra wrote:
>> >> > > On Thu, 2008-11-06 at 11:01 -0500, Vivek Goyal wrote:
>> >> > >
>> >> > > > > Does this still require I use dm, or does it also work on regular block
>> >> > > > > devices? Patch 4/4 isn't quite clear on this.
>> >> > > >
>> >> > > > No. You don't have to use dm. It will simply work on regular devices. We
>> >> > > > shall have to put few lines of code for it to work on devices which don't
>> >> > > > make use of standard __make_request() function and provide their own
>> >> > > > make_request function.
>> >> > > >
>> >> > > > Hence for example, I have put that few lines of code so that it can work
>> >> > > > with dm device. I shall have to do something similar for md too.
>> >> > > >
>> >> > > > Though, I am not very sure why do I need to do IO control on higher level
>> >> > > > devices. Will it be sufficient if we just control only bottom most
>> >> > > > physical block devices?
>> >> > > >
>> >> > > > Anyway, this approach should work at any level.
>> >> > >
>> >> > > Nice, although I would think only doing the higher level devices makes
>> >> > > more sense than only doing the leafs.
>> >> > >
>> >> >
>> >> > I thought that we should be doing any kind of resource management only at
>> >> > the level where there is actual contention for the resources.So in this case
>> >> > looks like only bottom most devices are slow and don't have infinite bandwidth
>> >> > hence the contention.(I am not taking into account the contention at
>> >> > bus level or contention at interconnect level for external storage,
>> >> > assuming interconnect is not the bottleneck).
>> >> >
>> >> > For example, lets say there is one linear device mapper device dm-0 on
>> >> > top of physical devices sda and sdb. Assuming two tasks in two different
>> >> > cgroups are reading two different files from deivce dm-0. Now if these
>> >> > files both fall on same physical device (either sda or sdb), then they
>> >> > will be contending for resources. But if files being read are on different
>> >> > physical deivces then practically there is no device contention (Even on
>> >> > the surface it might look like that dm-0 is being contended for). So if
>> >> > files are on different physical devices, IO controller will not know it.
>> >> > He will simply dispatch one group at a time and other device might remain
>> >> > idle.
>> >> >
>> >> > Keeping that in mind I thought we will be able to make use of full
>> >> > available bandwidth if we do IO control only at bottom most device. Doing
>> >> > it at higher layer has potential of not making use of full available bandwidth.
>> >> >
>> >> > > Is there any reason we cannot merge this with the regular io-scheduler
>> >> > > interface? afaik the only problem with doing group scheduling in the
>> >> > > io-schedulers is the stacked devices issue.
>> >> >
>> >> > I think we should be able to merge it with regular io schedulers. Apart
>> >> > from stacked device issue, people also mentioned that it is so closely
>> >> > tied to IO schedulers that we will end up doing four implementations for
>> >> > four schedulers and that is not very good from maintenance perspective.
>> >> >
>> >> > But I will spend more time in finding out if there is a common ground
>> >> > between schedulers so that a lot of common IO control code can be used
>> >> > in all the schedulers.
>> >> >
>> >> > >
>> >> > > Could we make the io-schedulers aware of this hierarchy?
>> >> >
>> >> > You mean IO schedulers knowing that there is somebody above them doing
>> >> > proportional weight dispatching of bios? If yes, how would that help?
>> >>
>> >> Well, take the slightly more elaborate example or a raid[56] setup. This
>> >> will need to sometimes issue multiple leaf level ios to satisfy one top
>> >> level io.
>> >>
>> >> How are you going to attribute this fairly?
>> >>
>> >
>> > I think in this case, definition of fair allocation will be little
>> > different. We will do fair allocation only at the leaf nodes where
>> > there is actual contention, irrespective of higher level setup.
>> >
>> > So if higher level block device issues multiple ios to satisfy one top
>> > level io, we will actually do the bandwidth allocation only on
>> > those multiple ios because that's the real IO contending for disk
>> > bandwidth. And if these multiple ios are going to different physical
>> > devices, then contention management will take place on those devices.
>> >
>> > IOW, we will not worry about providing fairness at bios submitted to
>> > higher level devices. We will just pitch in for contention management
>> > only when request from various cgroups are contending for physical
>> > device at bottom most layers. Isn't if fair?
>> >
>> > Thanks
>> > Vivek
>> >
>> >> I don't think the issue of bandwidth availability like above will really
>> >> be an issue, if your stripe is set up symmetrically, the contention
>> >> should average out to both (all) disks in equal measures.
>> >>
>> >> The only real issue I can see is with linear volumes, but those are
>> >> stupid anyway - non of the gains but all the risks.
>> > --
>> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> > the body of a message to [email protected]
>> > More majordomo info at http://vger.kernel.org/majordomo-info.html
>> > Please read the FAQ at http://www.tux.org/lkml/
>> >
>
On Fri, Nov 07, 2008 at 11:31:44AM +0100, Peter Zijlstra wrote:
> On Fri, 2008-11-07 at 11:41 +1100, Dave Chinner wrote:
> > On Thu, Nov 06, 2008 at 06:11:27PM +0100, Peter Zijlstra wrote:
> > > On Thu, 2008-11-06 at 11:57 -0500, Rik van Riel wrote:
> > > > Peter Zijlstra wrote:
> > > >
> > > > > The only real issue I can see is with linear volumes, but
> > > > > those are stupid anyway - non of the gains but all the
> > > > > risks.
> > > >
> > > > Linear volumes may well be the most common ones.
> > > >
> > > > People start out with the filesystems at a certain size,
> > > > increasing onto a second (new) disk later, when more space
> > > > is required.
> > >
> > > Are they aware of how risky linear volumes are? I would
> > > discourage anyone from using them.
> >
> > In what way are they risky?
>
> You loose all your data when one disk dies, so your mtbf decreases
> with the number of disks in your linear span. And you get non of
> the benefits from having multiple disks, like extra speed from
> striping, or redundancy from raid.
Fmeh. Step back and think for a moment. How does every major
distro build redundant root drives?
Yeah, they build a mirror and then put LVM on top of the mirror
to partition it. Each partition is a *linear volume*, but
no single disk failure is going to lose data because it's
been put on top of a mirror.
IOWs, reliability of linear volumes is only an issue if you don't
build redundancy into your storage stack. Just like RAID0, a single
disk failure will lose data. So, most people use linear volumes on
top of RAID1 or RAID5 to avoid such a single disk failure problem.
People do the same thing with RAID0 - it's what RAID10 and RAID50
do....
Also, linear volume performance scalability is on a different axis
to striping. Striping improves bandwidth, but each disk in a stripe
tends to make the same head movements. Hence striping improves
sequential throughput but only provides limited iops scalability.
Effectively, striping only improves throughput while the disks are
not seeking a lot. Add a few parallel I/O streams, and a stripe will
start to slow down as each disk seeks between streams. i.e. disks
in stripes cannot be considered to be able to operate independently.
Linear voulmes create independent regions within the address space -
the regions can seek independently when under concurrent I/O and
hence iops scalability is much greater. Aggregate bandwidth is the
same a striping, it's just that a single stream is limited in
throughput. If you want to improve single stream throughput,
you stripe before you concatenate.
That's why people create layered storage systems like this:
linear volume
|->stripe
|-> md RAID5
|-> disk
|-> disk
|-> disk
|-> disk
|-> disk
|-> md RAID5
|-> disk
|-> disk
|-> disk
|-> disk
|-> disk
|->stripe
|-> md RAID5
......
|->stripe
......
What you then need is a filesystem that can spread the load over
such a layout. Lets use, for argument's sake, XFS and tell it the
geometry of the RAID5 luns that make up the volume so that it's
allocation is all nicely aligned. Then we match the allocation
group size to the size of each independent part of the linear
volume. Now when XFS spreads it's inodes and data over multiple
AGs, it's spreading the load across disks that can operate
concurrently....
Effectively, linear volumes are about as dangerous as striping.
If you don't build in redundancy at a level below the linear
volume or stripe, then you lose when something fails.
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Fri, Nov 07, 2008 at 01:36:20PM -0800, Nauman Rafique wrote:
> On Fri, Nov 7, 2008 at 6:19 AM, Vivek Goyal <[email protected]> wrote:
> > On Thu, Nov 06, 2008 at 03:07:57PM -0800, Nauman Rafique wrote:
> >> It seems that approaches with two level scheduling (DM-IOBand or this
> >> patch set on top and another scheduler at elevator) will have the
> >> possibility of undesirable interactions (see "issues" listed at the
> >> end of the second patch). For example, a request submitted as RT might
> >> get delayed at higher layers, even if cfq at elevator level is doing
> >> the right thing.
> >>
> >
> > Yep. Buffering of bios at higher layer can break underlying elevator's
> > assumptions.
> >
> > What if we start keeping track of task priorities and RT tasks in higher
> > level schedulers and dispatch the bios accordingly. Will it break the
> > underlying noop, deadline or AS?
>
> It will probably not. But then we have a cfq-like scheduler at higher
> level and we can agree that the combinations "cfq(higher
> level)-noop(lower level)", "cfq-deadline", "cfq-as" and "cfq-cfq"
> would probably work. But if we implement one high level cfq-like
> scheduler at a higher level, we would not take care of somebody who
> wants noop-noop or propotional-noop. The point I am trying to make is
> that there is probably no single one-size-fits-all solution for a
> higher level scheduler. And we should limit the arbitrary mixing and
> matching of higher level schedulers and elevator schedulers. That
> being said, the existence of a higher level scheduler is still a point
> of debate I guess, see my comments below.
>
Ya, implemeting CFQ like thing in higher level scheduler will make things
complex.
>
> >
> >> Moreover, if the requests in the higher level scheduler are dispatched
> >> as soon as they come, there would be no queuing at the higher layers,
> >> unless the request queue at the lower level fills up and causes a
> >> backlog. And in the absence of queuing, any work-conserving scheduler
> >> would behave as a no-op scheduler.
> >>
> >> These issues motivate to take a second look into two level scheduling.
> >> The main motivations for two level scheduling seem to be:
> >> (1) Support bandwidth division across multiple devices for RAID and LVMs.
> >
> > Nauman, can you give an example where we really need bandwidth division
> > for higher level devices.
> >
> > I am beginning to think that real contention is at leaf level physical
> > devices and not at higher level logical devices hence we should be doing
> > any resource management only at leaf level and not worry about higher
> > level logical devices.
> >
> > If this requirement goes away, then case of two level scheduler weakens
> > and one needs to think about doing changes at leaf level IO schedulers.
>
> I cannot agree with you more on this that there is only contention at
> the leaf level physical devices and bandwidth should be managed only
> there. But having seen earlier posts on this list, i feel some folks
> might not agree with us. For example, if we have RAID-0 striping, we
> might want to schedule requests based on accumulative bandwidth used
> over all devices. Again, I myself don't agree with moving scheduling
> at a higher level just to support that.
>
Hmm.., I am not very convinced that we need to do resource management
at RAID0 device. The common case of resource management is that a higher
priority task group is not deprived of resources because of lower priority
task group. So if there is no contention between two task groups (At leaf
node), then I might as well let them give them full access to RAID 0
logical device without any control.
Hope people who have requirement of control at higher level devices can
pitch in now and share their perspective.
> >
> >> (2) Divide bandwidth between different cgroups without modifying each
> >> of the existing schedulers (and without replicating the code).
> >>
> >> One possible approach to handle (1) is to keep track of bandwidth
> >> utilized by each cgroup in a per cgroup data structure (instead of a
> >> per cgroup per device data structure) and use that information to make
> >> scheduling decisions within the elevator level schedulers. Such a
> >> patch can be made flag-disabled if co-ordination across different
> >> device schedulers is not required.
> >>
> >
> > Can you give more details about it. I am not sure I understand it. Exactly
> > what information should be stored in each cgroup.
> >
> > I think per cgroup per device data structures are good so that an scheduer
> > will not worry about other devices present in the system and will just try
> > to arbitrate between various cgroup contending for that device. This goes
> > back to same issue of getting rid of requirement (1) from io controller.
>
> I was thinking that we can keep track of disk time used at each
> device, and keep the cumulative number in a per cgroup data structure.
> But that is only if we want to support bandwidth division across
> devices. You and me both agree that we probably do not need to do
> that.
>
> >
> >> And (2) can probably be handled by having one scheduler support
> >> different modes. For example, one possible mode is "propotional
> >> division between crgroups + no-op between threads of a cgroup" or "cfq
> >> between cgroups + cfq between threads of a cgroup". That would also
> >> help avoid combinations which might not work e.g RT request issue
> >> mentioned earlier in this email. And this unified scheduler can re-use
> >> code from all the existing patches.
> >>
> >
> > IIUC, you are suggesting some kind of unification between four IO
> > schedulers so that proportional weight code is not replicated and user can
> > switch mode on the fly based on tunables?
>
> Yes, that seems to be a solution to avoid replication of code. But we
> should also look at any other solutions that avoid replication of
> code, and also avoid scheduling in two different layers.
> In my opinion, scheduling at two different layers is problematic because
> (a) Any buffering done at a higher level will be artificial, unless
> the queues at lower levels are completely full. And if there is no
> buffering at a higher level, any scheduling scheme would be
> ineffective.
> (b) We cannot have an arbitrary mixing and matching of higher and
> lower level schedulers.
>
> (a) would exist in any solution in which requests are queued at
> multiple levels. Can you please comment on this with respect to the
> patch that you have posted?
>
I am not very sure about the queustion, but in my patch, buffering at
at higher layer is irrespective of the status of underlying queue. We
try our best to fill underlying queue with request, only subject to the
criteria of proportional bandwidth.
So, if there are two cgroups A and B and we allocate two cgroups 2000
tokens each to begin with. If A has consumed all the tokens soon and B
has not, then we will stop A from dispatching more requests and wait for
B to either issue more IO and consume tokens or get out of contention.
This can leave disk idle for sometime. We can probably do some
optimizations here.
Thanks
Vivek
> >>
> >> On Thu, Nov 6, 2008 at 9:08 AM, Vivek Goyal <[email protected]> wrote:
> >> > On Thu, Nov 06, 2008 at 05:52:07PM +0100, Peter Zijlstra wrote:
> >> >> On Thu, 2008-11-06 at 11:39 -0500, Vivek Goyal wrote:
> >> >> > On Thu, Nov 06, 2008 at 05:16:13PM +0100, Peter Zijlstra wrote:
> >> >> > > On Thu, 2008-11-06 at 11:01 -0500, Vivek Goyal wrote:
> >> >> > >
> >> >> > > > > Does this still require I use dm, or does it also work on regular block
> >> >> > > > > devices? Patch 4/4 isn't quite clear on this.
> >> >> > > >
> >> >> > > > No. You don't have to use dm. It will simply work on regular devices. We
> >> >> > > > shall have to put few lines of code for it to work on devices which don't
> >> >> > > > make use of standard __make_request() function and provide their own
> >> >> > > > make_request function.
> >> >> > > >
> >> >> > > > Hence for example, I have put that few lines of code so that it can work
> >> >> > > > with dm device. I shall have to do something similar for md too.
> >> >> > > >
> >> >> > > > Though, I am not very sure why do I need to do IO control on higher level
> >> >> > > > devices. Will it be sufficient if we just control only bottom most
> >> >> > > > physical block devices?
> >> >> > > >
> >> >> > > > Anyway, this approach should work at any level.
> >> >> > >
> >> >> > > Nice, although I would think only doing the higher level devices makes
> >> >> > > more sense than only doing the leafs.
> >> >> > >
> >> >> >
> >> >> > I thought that we should be doing any kind of resource management only at
> >> >> > the level where there is actual contention for the resources.So in this case
> >> >> > looks like only bottom most devices are slow and don't have infinite bandwidth
> >> >> > hence the contention.(I am not taking into account the contention at
> >> >> > bus level or contention at interconnect level for external storage,
> >> >> > assuming interconnect is not the bottleneck).
> >> >> >
> >> >> > For example, lets say there is one linear device mapper device dm-0 on
> >> >> > top of physical devices sda and sdb. Assuming two tasks in two different
> >> >> > cgroups are reading two different files from deivce dm-0. Now if these
> >> >> > files both fall on same physical device (either sda or sdb), then they
> >> >> > will be contending for resources. But if files being read are on different
> >> >> > physical deivces then practically there is no device contention (Even on
> >> >> > the surface it might look like that dm-0 is being contended for). So if
> >> >> > files are on different physical devices, IO controller will not know it.
> >> >> > He will simply dispatch one group at a time and other device might remain
> >> >> > idle.
> >> >> >
> >> >> > Keeping that in mind I thought we will be able to make use of full
> >> >> > available bandwidth if we do IO control only at bottom most device. Doing
> >> >> > it at higher layer has potential of not making use of full available bandwidth.
> >> >> >
> >> >> > > Is there any reason we cannot merge this with the regular io-scheduler
> >> >> > > interface? afaik the only problem with doing group scheduling in the
> >> >> > > io-schedulers is the stacked devices issue.
> >> >> >
> >> >> > I think we should be able to merge it with regular io schedulers. Apart
> >> >> > from stacked device issue, people also mentioned that it is so closely
> >> >> > tied to IO schedulers that we will end up doing four implementations for
> >> >> > four schedulers and that is not very good from maintenance perspective.
> >> >> >
> >> >> > But I will spend more time in finding out if there is a common ground
> >> >> > between schedulers so that a lot of common IO control code can be used
> >> >> > in all the schedulers.
> >> >> >
> >> >> > >
> >> >> > > Could we make the io-schedulers aware of this hierarchy?
> >> >> >
> >> >> > You mean IO schedulers knowing that there is somebody above them doing
> >> >> > proportional weight dispatching of bios? If yes, how would that help?
> >> >>
> >> >> Well, take the slightly more elaborate example or a raid[56] setup. This
> >> >> will need to sometimes issue multiple leaf level ios to satisfy one top
> >> >> level io.
> >> >>
> >> >> How are you going to attribute this fairly?
> >> >>
> >> >
> >> > I think in this case, definition of fair allocation will be little
> >> > different. We will do fair allocation only at the leaf nodes where
> >> > there is actual contention, irrespective of higher level setup.
> >> >
> >> > So if higher level block device issues multiple ios to satisfy one top
> >> > level io, we will actually do the bandwidth allocation only on
> >> > those multiple ios because that's the real IO contending for disk
> >> > bandwidth. And if these multiple ios are going to different physical
> >> > devices, then contention management will take place on those devices.
> >> >
> >> > IOW, we will not worry about providing fairness at bios submitted to
> >> > higher level devices. We will just pitch in for contention management
> >> > only when request from various cgroups are contending for physical
> >> > device at bottom most layers. Isn't if fair?
> >> >
> >> > Thanks
> >> > Vivek
> >> >
> >> >> I don't think the issue of bandwidth availability like above will really
> >> >> be an issue, if your stripe is set up symmetrically, the contention
> >> >> should average out to both (all) disks in equal measures.
> >> >>
> >> >> The only real issue I can see is with linear volumes, but those are
> >> >> stupid anyway - non of the gains but all the risks.
> >> > --
> >> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> >> > the body of a message to [email protected]
> >> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> >> > Please read the FAQ at http://www.tux.org/lkml/
> >> >
> >
On Mon, Nov 10, 2008 at 6:11 AM, Vivek Goyal <[email protected]> wrote:
> On Fri, Nov 07, 2008 at 01:36:20PM -0800, Nauman Rafique wrote:
>> On Fri, Nov 7, 2008 at 6:19 AM, Vivek Goyal <[email protected]> wrote:
>> > On Thu, Nov 06, 2008 at 03:07:57PM -0800, Nauman Rafique wrote:
>> >> It seems that approaches with two level scheduling (DM-IOBand or this
>> >> patch set on top and another scheduler at elevator) will have the
>> >> possibility of undesirable interactions (see "issues" listed at the
>> >> end of the second patch). For example, a request submitted as RT might
>> >> get delayed at higher layers, even if cfq at elevator level is doing
>> >> the right thing.
>> >>
>> >
>> > Yep. Buffering of bios at higher layer can break underlying elevator's
>> > assumptions.
>> >
>> > What if we start keeping track of task priorities and RT tasks in higher
>> > level schedulers and dispatch the bios accordingly. Will it break the
>> > underlying noop, deadline or AS?
>>
>> It will probably not. But then we have a cfq-like scheduler at higher
>> level and we can agree that the combinations "cfq(higher
>> level)-noop(lower level)", "cfq-deadline", "cfq-as" and "cfq-cfq"
>> would probably work. But if we implement one high level cfq-like
>> scheduler at a higher level, we would not take care of somebody who
>> wants noop-noop or propotional-noop. The point I am trying to make is
>> that there is probably no single one-size-fits-all solution for a
>> higher level scheduler. And we should limit the arbitrary mixing and
>> matching of higher level schedulers and elevator schedulers. That
>> being said, the existence of a higher level scheduler is still a point
>> of debate I guess, see my comments below.
>>
>
> Ya, implemeting CFQ like thing in higher level scheduler will make things
> complex.
>
>>
>> >
>> >> Moreover, if the requests in the higher level scheduler are dispatched
>> >> as soon as they come, there would be no queuing at the higher layers,
>> >> unless the request queue at the lower level fills up and causes a
>> >> backlog. And in the absence of queuing, any work-conserving scheduler
>> >> would behave as a no-op scheduler.
>> >>
>> >> These issues motivate to take a second look into two level scheduling.
>> >> The main motivations for two level scheduling seem to be:
>> >> (1) Support bandwidth division across multiple devices for RAID and LVMs.
>> >
>> > Nauman, can you give an example where we really need bandwidth division
>> > for higher level devices.
>> >
>> > I am beginning to think that real contention is at leaf level physical
>> > devices and not at higher level logical devices hence we should be doing
>> > any resource management only at leaf level and not worry about higher
>> > level logical devices.
>> >
>> > If this requirement goes away, then case of two level scheduler weakens
>> > and one needs to think about doing changes at leaf level IO schedulers.
>>
>> I cannot agree with you more on this that there is only contention at
>> the leaf level physical devices and bandwidth should be managed only
>> there. But having seen earlier posts on this list, i feel some folks
>> might not agree with us. For example, if we have RAID-0 striping, we
>> might want to schedule requests based on accumulative bandwidth used
>> over all devices. Again, I myself don't agree with moving scheduling
>> at a higher level just to support that.
>>
>
> Hmm.., I am not very convinced that we need to do resource management
> at RAID0 device. The common case of resource management is that a higher
> priority task group is not deprived of resources because of lower priority
> task group. So if there is no contention between two task groups (At leaf
> node), then I might as well let them give them full access to RAID 0
> logical device without any control.
>
> Hope people who have requirement of control at higher level devices can
> pitch in now and share their perspective.
>
>> >
>> >> (2) Divide bandwidth between different cgroups without modifying each
>> >> of the existing schedulers (and without replicating the code).
>> >>
>> >> One possible approach to handle (1) is to keep track of bandwidth
>> >> utilized by each cgroup in a per cgroup data structure (instead of a
>> >> per cgroup per device data structure) and use that information to make
>> >> scheduling decisions within the elevator level schedulers. Such a
>> >> patch can be made flag-disabled if co-ordination across different
>> >> device schedulers is not required.
>> >>
>> >
>> > Can you give more details about it. I am not sure I understand it. Exactly
>> > what information should be stored in each cgroup.
>> >
>> > I think per cgroup per device data structures are good so that an scheduer
>> > will not worry about other devices present in the system and will just try
>> > to arbitrate between various cgroup contending for that device. This goes
>> > back to same issue of getting rid of requirement (1) from io controller.
>>
>> I was thinking that we can keep track of disk time used at each
>> device, and keep the cumulative number in a per cgroup data structure.
>> But that is only if we want to support bandwidth division across
>> devices. You and me both agree that we probably do not need to do
>> that.
>>
>> >
>> >> And (2) can probably be handled by having one scheduler support
>> >> different modes. For example, one possible mode is "propotional
>> >> division between crgroups + no-op between threads of a cgroup" or "cfq
>> >> between cgroups + cfq between threads of a cgroup". That would also
>> >> help avoid combinations which might not work e.g RT request issue
>> >> mentioned earlier in this email. And this unified scheduler can re-use
>> >> code from all the existing patches.
>> >>
>> >
>> > IIUC, you are suggesting some kind of unification between four IO
>> > schedulers so that proportional weight code is not replicated and user can
>> > switch mode on the fly based on tunables?
>>
>> Yes, that seems to be a solution to avoid replication of code. But we
>> should also look at any other solutions that avoid replication of
>> code, and also avoid scheduling in two different layers.
>> In my opinion, scheduling at two different layers is problematic because
>> (a) Any buffering done at a higher level will be artificial, unless
>> the queues at lower levels are completely full. And if there is no
>> buffering at a higher level, any scheduling scheme would be
>> ineffective.
>> (b) We cannot have an arbitrary mixing and matching of higher and
>> lower level schedulers.
>>
>> (a) would exist in any solution in which requests are queued at
>> multiple levels. Can you please comment on this with respect to the
>> patch that you have posted?
>>
>
> I am not very sure about the queustion, but in my patch, buffering at
> at higher layer is irrespective of the status of underlying queue. We
> try our best to fill underlying queue with request, only subject to the
> criteria of proportional bandwidth.
>
> So, if there are two cgroups A and B and we allocate two cgroups 2000
> tokens each to begin with. If A has consumed all the tokens soon and B
> has not, then we will stop A from dispatching more requests and wait for
> B to either issue more IO and consume tokens or get out of contention.
> This can leave disk idle for sometime. We can probably do some
> optimizations here.
What do you think about elevator based solutions like 2 level cfq
patches submitted by Satoshi and Vasily earlier? CFQ can be trivially
modified to do proportional division (i.e give time slices in
proportion to weight instead of priority). And such a solution would
avoid idleness problem like the one you mentioned above and can also
avoid burstiness issues (see smoothing patches -- v1.2.0 and v1.3.0 --
of dm-ioband) in token based schemes.
Also doing time based token allocation (as you mentioned in TODO list)
sounds very interesting. Can we look at the disk time taken by each
bio and use that to account for tokens? The problem is that the time
taken is not available when the requests are sent to disk, but we can
do delayed token charging (i.e deduct tokens after the request is
completed?). It seems that such an approach should work. What do you
think?
>
> Thanks
> Vivek
>
>> >>
>> >> On Thu, Nov 6, 2008 at 9:08 AM, Vivek Goyal <[email protected]> wrote:
>> >> > On Thu, Nov 06, 2008 at 05:52:07PM +0100, Peter Zijlstra wrote:
>> >> >> On Thu, 2008-11-06 at 11:39 -0500, Vivek Goyal wrote:
>> >> >> > On Thu, Nov 06, 2008 at 05:16:13PM +0100, Peter Zijlstra wrote:
>> >> >> > > On Thu, 2008-11-06 at 11:01 -0500, Vivek Goyal wrote:
>> >> >> > >
>> >> >> > > > > Does this still require I use dm, or does it also work on regular block
>> >> >> > > > > devices? Patch 4/4 isn't quite clear on this.
>> >> >> > > >
>> >> >> > > > No. You don't have to use dm. It will simply work on regular devices. We
>> >> >> > > > shall have to put few lines of code for it to work on devices which don't
>> >> >> > > > make use of standard __make_request() function and provide their own
>> >> >> > > > make_request function.
>> >> >> > > >
>> >> >> > > > Hence for example, I have put that few lines of code so that it can work
>> >> >> > > > with dm device. I shall have to do something similar for md too.
>> >> >> > > >
>> >> >> > > > Though, I am not very sure why do I need to do IO control on higher level
>> >> >> > > > devices. Will it be sufficient if we just control only bottom most
>> >> >> > > > physical block devices?
>> >> >> > > >
>> >> >> > > > Anyway, this approach should work at any level.
>> >> >> > >
>> >> >> > > Nice, although I would think only doing the higher level devices makes
>> >> >> > > more sense than only doing the leafs.
>> >> >> > >
>> >> >> >
>> >> >> > I thought that we should be doing any kind of resource management only at
>> >> >> > the level where there is actual contention for the resources.So in this case
>> >> >> > looks like only bottom most devices are slow and don't have infinite bandwidth
>> >> >> > hence the contention.(I am not taking into account the contention at
>> >> >> > bus level or contention at interconnect level for external storage,
>> >> >> > assuming interconnect is not the bottleneck).
>> >> >> >
>> >> >> > For example, lets say there is one linear device mapper device dm-0 on
>> >> >> > top of physical devices sda and sdb. Assuming two tasks in two different
>> >> >> > cgroups are reading two different files from deivce dm-0. Now if these
>> >> >> > files both fall on same physical device (either sda or sdb), then they
>> >> >> > will be contending for resources. But if files being read are on different
>> >> >> > physical deivces then practically there is no device contention (Even on
>> >> >> > the surface it might look like that dm-0 is being contended for). So if
>> >> >> > files are on different physical devices, IO controller will not know it.
>> >> >> > He will simply dispatch one group at a time and other device might remain
>> >> >> > idle.
>> >> >> >
>> >> >> > Keeping that in mind I thought we will be able to make use of full
>> >> >> > available bandwidth if we do IO control only at bottom most device. Doing
>> >> >> > it at higher layer has potential of not making use of full available bandwidth.
>> >> >> >
>> >> >> > > Is there any reason we cannot merge this with the regular io-scheduler
>> >> >> > > interface? afaik the only problem with doing group scheduling in the
>> >> >> > > io-schedulers is the stacked devices issue.
>> >> >> >
>> >> >> > I think we should be able to merge it with regular io schedulers. Apart
>> >> >> > from stacked device issue, people also mentioned that it is so closely
>> >> >> > tied to IO schedulers that we will end up doing four implementations for
>> >> >> > four schedulers and that is not very good from maintenance perspective.
>> >> >> >
>> >> >> > But I will spend more time in finding out if there is a common ground
>> >> >> > between schedulers so that a lot of common IO control code can be used
>> >> >> > in all the schedulers.
>> >> >> >
>> >> >> > >
>> >> >> > > Could we make the io-schedulers aware of this hierarchy?
>> >> >> >
>> >> >> > You mean IO schedulers knowing that there is somebody above them doing
>> >> >> > proportional weight dispatching of bios? If yes, how would that help?
>> >> >>
>> >> >> Well, take the slightly more elaborate example or a raid[56] setup. This
>> >> >> will need to sometimes issue multiple leaf level ios to satisfy one top
>> >> >> level io.
>> >> >>
>> >> >> How are you going to attribute this fairly?
>> >> >>
>> >> >
>> >> > I think in this case, definition of fair allocation will be little
>> >> > different. We will do fair allocation only at the leaf nodes where
>> >> > there is actual contention, irrespective of higher level setup.
>> >> >
>> >> > So if higher level block device issues multiple ios to satisfy one top
>> >> > level io, we will actually do the bandwidth allocation only on
>> >> > those multiple ios because that's the real IO contending for disk
>> >> > bandwidth. And if these multiple ios are going to different physical
>> >> > devices, then contention management will take place on those devices.
>> >> >
>> >> > IOW, we will not worry about providing fairness at bios submitted to
>> >> > higher level devices. We will just pitch in for contention management
>> >> > only when request from various cgroups are contending for physical
>> >> > device at bottom most layers. Isn't if fair?
>> >> >
>> >> > Thanks
>> >> > Vivek
>> >> >
>> >> >> I don't think the issue of bandwidth availability like above will really
>> >> >> be an issue, if your stripe is set up symmetrically, the contention
>> >> >> should average out to both (all) disks in equal measures.
>> >> >>
>> >> >> The only real issue I can see is with linear volumes, but those are
>> >> >> stupid anyway - non of the gains but all the risks.
>> >> > --
>> >> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> >> > the body of a message to [email protected]
>> >> > More majordomo info at http://vger.kernel.org/majordomo-info.html
>> >> > Please read the FAQ at http://www.tux.org/lkml/
>> >> >
>> >
>
On Tue, Nov 11, 2008 at 11:55:53AM -0800, Nauman Rafique wrote:
> On Mon, Nov 10, 2008 at 6:11 AM, Vivek Goyal <[email protected]> wrote:
> > On Fri, Nov 07, 2008 at 01:36:20PM -0800, Nauman Rafique wrote:
> >> On Fri, Nov 7, 2008 at 6:19 AM, Vivek Goyal <[email protected]> wrote:
> >> > On Thu, Nov 06, 2008 at 03:07:57PM -0800, Nauman Rafique wrote:
> >> >> It seems that approaches with two level scheduling (DM-IOBand or this
> >> >> patch set on top and another scheduler at elevator) will have the
> >> >> possibility of undesirable interactions (see "issues" listed at the
> >> >> end of the second patch). For example, a request submitted as RT might
> >> >> get delayed at higher layers, even if cfq at elevator level is doing
> >> >> the right thing.
> >> >>
> >> >
> >> > Yep. Buffering of bios at higher layer can break underlying elevator's
> >> > assumptions.
> >> >
> >> > What if we start keeping track of task priorities and RT tasks in higher
> >> > level schedulers and dispatch the bios accordingly. Will it break the
> >> > underlying noop, deadline or AS?
> >>
> >> It will probably not. But then we have a cfq-like scheduler at higher
> >> level and we can agree that the combinations "cfq(higher
> >> level)-noop(lower level)", "cfq-deadline", "cfq-as" and "cfq-cfq"
> >> would probably work. But if we implement one high level cfq-like
> >> scheduler at a higher level, we would not take care of somebody who
> >> wants noop-noop or propotional-noop. The point I am trying to make is
> >> that there is probably no single one-size-fits-all solution for a
> >> higher level scheduler. And we should limit the arbitrary mixing and
> >> matching of higher level schedulers and elevator schedulers. That
> >> being said, the existence of a higher level scheduler is still a point
> >> of debate I guess, see my comments below.
> >>
> >
> > Ya, implemeting CFQ like thing in higher level scheduler will make things
> > complex.
> >
> >>
> >> >
> >> >> Moreover, if the requests in the higher level scheduler are dispatched
> >> >> as soon as they come, there would be no queuing at the higher layers,
> >> >> unless the request queue at the lower level fills up and causes a
> >> >> backlog. And in the absence of queuing, any work-conserving scheduler
> >> >> would behave as a no-op scheduler.
> >> >>
> >> >> These issues motivate to take a second look into two level scheduling.
> >> >> The main motivations for two level scheduling seem to be:
> >> >> (1) Support bandwidth division across multiple devices for RAID and LVMs.
> >> >
> >> > Nauman, can you give an example where we really need bandwidth division
> >> > for higher level devices.
> >> >
> >> > I am beginning to think that real contention is at leaf level physical
> >> > devices and not at higher level logical devices hence we should be doing
> >> > any resource management only at leaf level and not worry about higher
> >> > level logical devices.
> >> >
> >> > If this requirement goes away, then case of two level scheduler weakens
> >> > and one needs to think about doing changes at leaf level IO schedulers.
> >>
> >> I cannot agree with you more on this that there is only contention at
> >> the leaf level physical devices and bandwidth should be managed only
> >> there. But having seen earlier posts on this list, i feel some folks
> >> might not agree with us. For example, if we have RAID-0 striping, we
> >> might want to schedule requests based on accumulative bandwidth used
> >> over all devices. Again, I myself don't agree with moving scheduling
> >> at a higher level just to support that.
> >>
> >
> > Hmm.., I am not very convinced that we need to do resource management
> > at RAID0 device. The common case of resource management is that a higher
> > priority task group is not deprived of resources because of lower priority
> > task group. So if there is no contention between two task groups (At leaf
> > node), then I might as well let them give them full access to RAID 0
> > logical device without any control.
> >
> > Hope people who have requirement of control at higher level devices can
> > pitch in now and share their perspective.
> >
> >> >
> >> >> (2) Divide bandwidth between different cgroups without modifying each
> >> >> of the existing schedulers (and without replicating the code).
> >> >>
> >> >> One possible approach to handle (1) is to keep track of bandwidth
> >> >> utilized by each cgroup in a per cgroup data structure (instead of a
> >> >> per cgroup per device data structure) and use that information to make
> >> >> scheduling decisions within the elevator level schedulers. Such a
> >> >> patch can be made flag-disabled if co-ordination across different
> >> >> device schedulers is not required.
> >> >>
> >> >
> >> > Can you give more details about it. I am not sure I understand it. Exactly
> >> > what information should be stored in each cgroup.
> >> >
> >> > I think per cgroup per device data structures are good so that an scheduer
> >> > will not worry about other devices present in the system and will just try
> >> > to arbitrate between various cgroup contending for that device. This goes
> >> > back to same issue of getting rid of requirement (1) from io controller.
> >>
> >> I was thinking that we can keep track of disk time used at each
> >> device, and keep the cumulative number in a per cgroup data structure.
> >> But that is only if we want to support bandwidth division across
> >> devices. You and me both agree that we probably do not need to do
> >> that.
> >>
> >> >
> >> >> And (2) can probably be handled by having one scheduler support
> >> >> different modes. For example, one possible mode is "propotional
> >> >> division between crgroups + no-op between threads of a cgroup" or "cfq
> >> >> between cgroups + cfq between threads of a cgroup". That would also
> >> >> help avoid combinations which might not work e.g RT request issue
> >> >> mentioned earlier in this email. And this unified scheduler can re-use
> >> >> code from all the existing patches.
> >> >>
> >> >
> >> > IIUC, you are suggesting some kind of unification between four IO
> >> > schedulers so that proportional weight code is not replicated and user can
> >> > switch mode on the fly based on tunables?
> >>
> >> Yes, that seems to be a solution to avoid replication of code. But we
> >> should also look at any other solutions that avoid replication of
> >> code, and also avoid scheduling in two different layers.
> >> In my opinion, scheduling at two different layers is problematic because
> >> (a) Any buffering done at a higher level will be artificial, unless
> >> the queues at lower levels are completely full. And if there is no
> >> buffering at a higher level, any scheduling scheme would be
> >> ineffective.
> >> (b) We cannot have an arbitrary mixing and matching of higher and
> >> lower level schedulers.
> >>
> >> (a) would exist in any solution in which requests are queued at
> >> multiple levels. Can you please comment on this with respect to the
> >> patch that you have posted?
> >>
> >
> > I am not very sure about the queustion, but in my patch, buffering at
> > at higher layer is irrespective of the status of underlying queue. We
> > try our best to fill underlying queue with request, only subject to the
> > criteria of proportional bandwidth.
> >
> > So, if there are two cgroups A and B and we allocate two cgroups 2000
> > tokens each to begin with. If A has consumed all the tokens soon and B
> > has not, then we will stop A from dispatching more requests and wait for
> > B to either issue more IO and consume tokens or get out of contention.
> > This can leave disk idle for sometime. We can probably do some
> > optimizations here.
>
> What do you think about elevator based solutions like 2 level cfq
> patches submitted by Satoshi and Vasily earlier?
I have had a very high level look at Satoshi's patch. I will go into
details soon. I was thinking that this patch solves the problem only
for CFQ. Can we create a common layer which can be shared by all
the four IO schedulers.
So this one common layer can take care of all the management w.r.t
per device per cgroup data structures and track all the groups, their
limits (either token based or time based scheme), and control the
dispatch of requests.
This way we can enable IO controller not only for CFQ but for all the
IO schedulers without duplicating too much of code.
This is what I am playing around with currently. At this point I am
not sure, how much of common ground I can have between all the IO
schedulers.
> CFQ can be trivially
> modified to do proportional division (i.e give time slices in
> proportion to weight instead of priority).
> And such a solution would
> avoid idleness problem like the one you mentioned above.
Can you just elaborate a little on how do you get around idleness problem?
If you don't create idleness than if two tasks in two cgroups are doing
sequential IO, they might simply get into lockstep and we will not achieve
any differentiated service proportionate to their weight.
> and can also
> avoid burstiness issues (see smoothing patches -- v1.2.0 and v1.3.0 --
> of dm-ioband) in token based schemes.
>
> Also doing time based token allocation (as you mentioned in TODO list)
> sounds very interesting. Can we look at the disk time taken by each
> bio and use that to account for tokens? The problem is that the time
> taken is not available when the requests are sent to disk, but we can
> do delayed token charging (i.e deduct tokens after the request is
> completed?). It seems that such an approach should work. What do you
> think?
This is a good idea. Charging the cgroup based on time actually consumed
should be doable. I will look into it. I think in the past somebody
mentioned that how do you account for the seek time taken because of
switchover between cgroups? May be average time per cgroup can help here
a bit.
This is more about refining the dispatch algorightm once we have agreed
upon other semantics like 2 level scheduler and can we come up with a common
layer which can be shared by all four IO schedulers. Once common layer is
possible, we can always change the common layer algorithm from token based
to time based to achive better accuracy.
Thanks
Vivek
On Tue, Nov 11, 2008 at 2:30 PM, Vivek Goyal <[email protected]> wrote:
> On Tue, Nov 11, 2008 at 11:55:53AM -0800, Nauman Rafique wrote:
>> On Mon, Nov 10, 2008 at 6:11 AM, Vivek Goyal <[email protected]> wrote:
>> > On Fri, Nov 07, 2008 at 01:36:20PM -0800, Nauman Rafique wrote:
>> >> On Fri, Nov 7, 2008 at 6:19 AM, Vivek Goyal <[email protected]> wrote:
>> >> > On Thu, Nov 06, 2008 at 03:07:57PM -0800, Nauman Rafique wrote:
>> >> >> It seems that approaches with two level scheduling (DM-IOBand or this
>> >> >> patch set on top and another scheduler at elevator) will have the
>> >> >> possibility of undesirable interactions (see "issues" listed at the
>> >> >> end of the second patch). For example, a request submitted as RT might
>> >> >> get delayed at higher layers, even if cfq at elevator level is doing
>> >> >> the right thing.
>> >> >>
>> >> >
>> >> > Yep. Buffering of bios at higher layer can break underlying elevator's
>> >> > assumptions.
>> >> >
>> >> > What if we start keeping track of task priorities and RT tasks in higher
>> >> > level schedulers and dispatch the bios accordingly. Will it break the
>> >> > underlying noop, deadline or AS?
>> >>
>> >> It will probably not. But then we have a cfq-like scheduler at higher
>> >> level and we can agree that the combinations "cfq(higher
>> >> level)-noop(lower level)", "cfq-deadline", "cfq-as" and "cfq-cfq"
>> >> would probably work. But if we implement one high level cfq-like
>> >> scheduler at a higher level, we would not take care of somebody who
>> >> wants noop-noop or propotional-noop. The point I am trying to make is
>> >> that there is probably no single one-size-fits-all solution for a
>> >> higher level scheduler. And we should limit the arbitrary mixing and
>> >> matching of higher level schedulers and elevator schedulers. That
>> >> being said, the existence of a higher level scheduler is still a point
>> >> of debate I guess, see my comments below.
>> >>
>> >
>> > Ya, implemeting CFQ like thing in higher level scheduler will make things
>> > complex.
>> >
>> >>
>> >> >
>> >> >> Moreover, if the requests in the higher level scheduler are dispatched
>> >> >> as soon as they come, there would be no queuing at the higher layers,
>> >> >> unless the request queue at the lower level fills up and causes a
>> >> >> backlog. And in the absence of queuing, any work-conserving scheduler
>> >> >> would behave as a no-op scheduler.
>> >> >>
>> >> >> These issues motivate to take a second look into two level scheduling.
>> >> >> The main motivations for two level scheduling seem to be:
>> >> >> (1) Support bandwidth division across multiple devices for RAID and LVMs.
>> >> >
>> >> > Nauman, can you give an example where we really need bandwidth division
>> >> > for higher level devices.
>> >> >
>> >> > I am beginning to think that real contention is at leaf level physical
>> >> > devices and not at higher level logical devices hence we should be doing
>> >> > any resource management only at leaf level and not worry about higher
>> >> > level logical devices.
>> >> >
>> >> > If this requirement goes away, then case of two level scheduler weakens
>> >> > and one needs to think about doing changes at leaf level IO schedulers.
>> >>
>> >> I cannot agree with you more on this that there is only contention at
>> >> the leaf level physical devices and bandwidth should be managed only
>> >> there. But having seen earlier posts on this list, i feel some folks
>> >> might not agree with us. For example, if we have RAID-0 striping, we
>> >> might want to schedule requests based on accumulative bandwidth used
>> >> over all devices. Again, I myself don't agree with moving scheduling
>> >> at a higher level just to support that.
>> >>
>> >
>> > Hmm.., I am not very convinced that we need to do resource management
>> > at RAID0 device. The common case of resource management is that a higher
>> > priority task group is not deprived of resources because of lower priority
>> > task group. So if there is no contention between two task groups (At leaf
>> > node), then I might as well let them give them full access to RAID 0
>> > logical device without any control.
>> >
>> > Hope people who have requirement of control at higher level devices can
>> > pitch in now and share their perspective.
>> >
>> >> >
>> >> >> (2) Divide bandwidth between different cgroups without modifying each
>> >> >> of the existing schedulers (and without replicating the code).
>> >> >>
>> >> >> One possible approach to handle (1) is to keep track of bandwidth
>> >> >> utilized by each cgroup in a per cgroup data structure (instead of a
>> >> >> per cgroup per device data structure) and use that information to make
>> >> >> scheduling decisions within the elevator level schedulers. Such a
>> >> >> patch can be made flag-disabled if co-ordination across different
>> >> >> device schedulers is not required.
>> >> >>
>> >> >
>> >> > Can you give more details about it. I am not sure I understand it. Exactly
>> >> > what information should be stored in each cgroup.
>> >> >
>> >> > I think per cgroup per device data structures are good so that an scheduer
>> >> > will not worry about other devices present in the system and will just try
>> >> > to arbitrate between various cgroup contending for that device. This goes
>> >> > back to same issue of getting rid of requirement (1) from io controller.
>> >>
>> >> I was thinking that we can keep track of disk time used at each
>> >> device, and keep the cumulative number in a per cgroup data structure.
>> >> But that is only if we want to support bandwidth division across
>> >> devices. You and me both agree that we probably do not need to do
>> >> that.
>> >>
>> >> >
>> >> >> And (2) can probably be handled by having one scheduler support
>> >> >> different modes. For example, one possible mode is "propotional
>> >> >> division between crgroups + no-op between threads of a cgroup" or "cfq
>> >> >> between cgroups + cfq between threads of a cgroup". That would also
>> >> >> help avoid combinations which might not work e.g RT request issue
>> >> >> mentioned earlier in this email. And this unified scheduler can re-use
>> >> >> code from all the existing patches.
>> >> >>
>> >> >
>> >> > IIUC, you are suggesting some kind of unification between four IO
>> >> > schedulers so that proportional weight code is not replicated and user can
>> >> > switch mode on the fly based on tunables?
>> >>
>> >> Yes, that seems to be a solution to avoid replication of code. But we
>> >> should also look at any other solutions that avoid replication of
>> >> code, and also avoid scheduling in two different layers.
>> >> In my opinion, scheduling at two different layers is problematic because
>> >> (a) Any buffering done at a higher level will be artificial, unless
>> >> the queues at lower levels are completely full. And if there is no
>> >> buffering at a higher level, any scheduling scheme would be
>> >> ineffective.
>> >> (b) We cannot have an arbitrary mixing and matching of higher and
>> >> lower level schedulers.
>> >>
>> >> (a) would exist in any solution in which requests are queued at
>> >> multiple levels. Can you please comment on this with respect to the
>> >> patch that you have posted?
>> >>
>> >
>> > I am not very sure about the queustion, but in my patch, buffering at
>> > at higher layer is irrespective of the status of underlying queue. We
>> > try our best to fill underlying queue with request, only subject to the
>> > criteria of proportional bandwidth.
>> >
>> > So, if there are two cgroups A and B and we allocate two cgroups 2000
>> > tokens each to begin with. If A has consumed all the tokens soon and B
>> > has not, then we will stop A from dispatching more requests and wait for
>> > B to either issue more IO and consume tokens or get out of contention.
>> > This can leave disk idle for sometime. We can probably do some
>> > optimizations here.
>>
>> What do you think about elevator based solutions like 2 level cfq
>> patches submitted by Satoshi and Vasily earlier?
>
> I have had a very high level look at Satoshi's patch. I will go into
> details soon. I was thinking that this patch solves the problem only
> for CFQ. Can we create a common layer which can be shared by all
> the four IO schedulers.
>
> So this one common layer can take care of all the management w.r.t
> per device per cgroup data structures and track all the groups, their
> limits (either token based or time based scheme), and control the
> dispatch of requests.
>
> This way we can enable IO controller not only for CFQ but for all the
> IO schedulers without duplicating too much of code.
>
> This is what I am playing around with currently. At this point I am
> not sure, how much of common ground I can have between all the IO
> schedulers.
I see your point. But having some common code in different schedulers
is not worse than what we have today (cfq, as, and deadline all have
some common code). Besides, each lower level (elevator level)
scheduler might impose certain requirements on higher level schedulers
(e.g RT requests for cfq that we talked about earlier).
>
>> CFQ can be trivially
>> modified to do proportional division (i.e give time slices in
>> proportion to weight instead of priority).
>> And such a solution would
>> avoid idleness problem like the one you mentioned above.
>
> Can you just elaborate a little on how do you get around idleness problem?
> If you don't create idleness than if two tasks in two cgroups are doing
> sequential IO, they might simply get into lockstep and we will not achieve
> any differentiated service proportionate to their weight.
I was thinking of a more cfq-like solution for proportional division
at the elevator level (i.e. not a token based solution). There are two
options for proportional bandwidth division at elevator level: 1)
change the size of the time slice in proportion to the weights or 2)
allocate equal time slice each time but allocate more slices to cgroup
with more weight. For (2), we can actually keep track of time taken to
serve requests and allocate time slices in such a way that the actual
disk time is proportional to the weight. We can adopt a fair-queuing
(http://lkml.org/lkml/2008/4/1/234) like approach for this if we want
to go that way.
I am not sure if the solutions mentioned above will have the lockstep
problem you mentioned above or not. Since we are allocating time
slices, and would have anticipation built in (just like cfq), we would
have some level of idleness. But this idleness can be predicted based
on a thread behavior. Can we use AS like algorithm for predicting idle
time before starting new epoch in your token based patch?
>
>> and can also
>> avoid burstiness issues (see smoothing patches -- v1.2.0 and v1.3.0 --
>> of dm-ioband) in token based schemes.
>>
>> Also doing time based token allocation (as you mentioned in TODO list)
>> sounds very interesting. Can we look at the disk time taken by each
>> bio and use that to account for tokens? The problem is that the time
>> taken is not available when the requests are sent to disk, but we can
>> do delayed token charging (i.e deduct tokens after the request is
>> completed?). It seems that such an approach should work. What do you
>> think?
>
> This is a good idea. Charging the cgroup based on time actually consumed
> should be doable. I will look into it. I think in the past somebody
> mentioned that how do you account for the seek time taken because of
> switchover between cgroups? May be average time per cgroup can help here
> a bit.
>
> This is more about refining the dispatch algorightm once we have agreed
> upon other semantics like 2 level scheduler and can we come up with a common
> layer which can be shared by all four IO schedulers. Once common layer is
> possible, we can always change the common layer algorithm from token based
> to time based to achive better accuracy.
>
> Thanks
> Vivek
>
Hi,
From: [email protected]
Subject: [patch 0/4] [RFC] Another proportional weight IO controller
Date: Thu, 06 Nov 2008 10:30:22 -0500
> Hi,
>
> If you are not already tired of so many io controller implementations, here
> is another one.
>
> This is a very eary very crude implementation to get early feedback to see
> if this approach makes any sense or not.
>
> This controller is a proportional weight IO controller primarily
> based on/inspired by dm-ioband. One of the things I personally found little
> odd about dm-ioband was need of a dm-ioband device for every device we want
> to control. I thought that probably we can make this control per request
> queue and get rid of device mapper driver. This should make configuration
> aspect easy.
>
> I have picked up quite some amount of code from dm-ioband especially for
> biocgroup implementation.
>
> I have done very basic testing and that is running 2-3 dd commands in different
> cgroups on x86_64. Wanted to throw out the code early to get some feedback.
>
> More details about the design and how to are in documentation patch.
>
> Your comments are welcome.
Do you have any benchmark results?
I'm especially interested in the followings:
- Comparison of disk performance with and without the I/O controller patch.
- Put uneven I/O loads. Processes, which belong to a cgroup which is
given a smaller weight than another cgroup, put heavier I/O load
like the following.
echo 1024 > /cgroup/bio/test1/bio.shares
echo 8192 > /cgroup/bio/test2/bio.shares
echo $$ > /cgroup/bio/test1/tasks
dd if=/somefile1-1 of=/dev/null &
dd if=/somefile1-2 of=/dev/null &
...
dd if=/somefile1-100 of=/dev/null
echo $$ > /cgroup/bio/test2/tasks
dd if=/somefile2-1 of=/dev/null &
dd if=/somefile2-2 of=/dev/null &
...
dd if=/somefile2-10 of=/dev/null &
Thanks,
Ryo Tsuruta
Hi,
> From: Nauman Rafique <[email protected]>
> Date: Wed, Nov 12, 2008 01:20:13PM -0800
>
...
> >> CFQ can be trivially
> >> modified to do proportional division (i.e give time slices in
> >> proportion to weight instead of priority).
> >> And such a solution would
> >> avoid idleness problem like the one you mentioned above.
> >
> > Can you just elaborate a little on how do you get around idleness problem?
> > If you don't create idleness than if two tasks in two cgroups are doing
> > sequential IO, they might simply get into lockstep and we will not achieve
> > any differentiated service proportionate to their weight.
>
> I was thinking of a more cfq-like solution for proportional division
> at the elevator level (i.e. not a token based solution). There are two
> options for proportional bandwidth division at elevator level: 1)
> change the size of the time slice in proportion to the weights or 2)
> allocate equal time slice each time but allocate more slices to cgroup
> with more weight. For (2), we can actually keep track of time taken to
> serve requests and allocate time slices in such a way that the actual
> disk time is proportional to the weight. We can adopt a fair-queuing
> (http://lkml.org/lkml/2008/4/1/234) like approach for this if we want
> to go that way.
>
> I am not sure if the solutions mentioned above will have the lockstep
> problem you mentioned above or not. Since we are allocating time
> slices, and would have anticipation built in (just like cfq), we would
> have some level of idleness. But this idleness can be predicted based
> on a thread behavior.
if I understand that correctly, the problem may arise whenever you
have to deal with *synchronous* I/O, where you may not see the streams
of requests generated by tasks as continuously backlogged (and the
algorithm used to distribute bandwidth does the implicit assumption
that they are, as in the cfq case).
A cfq-like solution with idling enabled AFAIK should not suffer from
this problem, as it creates backlog for the process being anticipated.
But anticipation is not always used, and cfq currently disables it for
SSDs and in other cases where it may hurt performance (e.g., NCQ drives
in presence of seeky loads, etc). So, in these cases, something still
needs to be done if we want a proportional bandwidth distribution, and
we don't want to pay the extra cost of idling when it's not strictly
necessary.
On Thu, Nov 13, 2008 at 06:05:58PM +0900, Ryo Tsuruta wrote:
> Hi,
>
> From: [email protected]
> Subject: [patch 0/4] [RFC] Another proportional weight IO controller
> Date: Thu, 06 Nov 2008 10:30:22 -0500
>
> > Hi,
> >
> > If you are not already tired of so many io controller implementations, here
> > is another one.
> >
> > This is a very eary very crude implementation to get early feedback to see
> > if this approach makes any sense or not.
> >
> > This controller is a proportional weight IO controller primarily
> > based on/inspired by dm-ioband. One of the things I personally found little
> > odd about dm-ioband was need of a dm-ioband device for every device we want
> > to control. I thought that probably we can make this control per request
> > queue and get rid of device mapper driver. This should make configuration
> > aspect easy.
> >
> > I have picked up quite some amount of code from dm-ioband especially for
> > biocgroup implementation.
> >
> > I have done very basic testing and that is running 2-3 dd commands in different
> > cgroups on x86_64. Wanted to throw out the code early to get some feedback.
> >
> > More details about the design and how to are in documentation patch.
> >
> > Your comments are welcome.
>
> Do you have any benchmark results?
> I'm especially interested in the followings:
> - Comparison of disk performance with and without the I/O controller patch.
If I dynamically disable the bio control, then I did not observe any
impact on performance. Because in that case practically it boils down
to just an additional variable check in __make_request().
> - Put uneven I/O loads. Processes, which belong to a cgroup which is
> given a smaller weight than another cgroup, put heavier I/O load
> like the following.
>
> echo 1024 > /cgroup/bio/test1/bio.shares
> echo 8192 > /cgroup/bio/test2/bio.shares
>
> echo $$ > /cgroup/bio/test1/tasks
> dd if=/somefile1-1 of=/dev/null &
> dd if=/somefile1-2 of=/dev/null &
> ...
> dd if=/somefile1-100 of=/dev/null
> echo $$ > /cgroup/bio/test2/tasks
> dd if=/somefile2-1 of=/dev/null &
> dd if=/somefile2-2 of=/dev/null &
> ...
> dd if=/somefile2-10 of=/dev/null &
I have not tried this case.
Ryo, do you still want to stick to two level scheduling? Given the problem
of it breaking down underlying scheduler's assumptions, probably it makes
more sense to the IO control at each individual IO scheduler.
I have had a very brief look at BFQ's hierarchical proportional
weight/priority IO control and it looks good. May be we can adopt it for
other IO schedulers also.
Thanks
Vivek
On Wed, Nov 12, 2008 at 01:20:13PM -0800, Nauman Rafique wrote:
[..]
> >> What do you think about elevator based solutions like 2 level cfq
> >> patches submitted by Satoshi and Vasily earlier?
> >
> > I have had a very high level look at Satoshi's patch. I will go into
> > details soon. I was thinking that this patch solves the problem only
> > for CFQ. Can we create a common layer which can be shared by all
> > the four IO schedulers.
> >
> > So this one common layer can take care of all the management w.r.t
> > per device per cgroup data structures and track all the groups, their
> > limits (either token based or time based scheme), and control the
> > dispatch of requests.
> >
> > This way we can enable IO controller not only for CFQ but for all the
> > IO schedulers without duplicating too much of code.
> >
> > This is what I am playing around with currently. At this point I am
> > not sure, how much of common ground I can have between all the IO
> > schedulers.
>
> I see your point. But having some common code in different schedulers
> is not worse than what we have today (cfq, as, and deadline all have
> some common code). Besides, each lower level (elevator level)
> scheduler might impose certain requirements on higher level schedulers
> (e.g RT requests for cfq that we talked about earlier).
>
> >
> >> CFQ can be trivially
> >> modified to do proportional division (i.e give time slices in
> >> proportion to weight instead of priority).
> >> And such a solution would
> >> avoid idleness problem like the one you mentioned above.
> >
> > Can you just elaborate a little on how do you get around idleness problem?
> > If you don't create idleness than if two tasks in two cgroups are doing
> > sequential IO, they might simply get into lockstep and we will not achieve
> > any differentiated service proportionate to their weight.
>
> I was thinking of a more cfq-like solution for proportional division
> at the elevator level (i.e. not a token based solution). There are two
> options for proportional bandwidth division at elevator level: 1)
> change the size of the time slice in proportion to the weights or 2)
> allocate equal time slice each time but allocate more slices to cgroup
> with more weight. For (2), we can actually keep track of time taken to
> serve requests and allocate time slices in such a way that the actual
> disk time is proportional to the weight. We can adopt a fair-queuing
> (http://lkml.org/lkml/2008/4/1/234) like approach for this if we want
> to go that way.
Hi Nauman,
I think doing proportional weight division at elevator level will be
little difficult, because if we go for a full hierarchical solution then
we will be doing proportional weight division among tasks as well as
groups.
For example, consider this. Assume at root level there are three tasks
A, B, C and two cgroups D and E. Now for proportional weight division we
should consider A, B, C, D and E at same level and then try to divide
the BW (Thanks to peterz for clarifying this).
Other approach could be that consider A, B, C in root cgroup and then
consider root, D and E competing groups and try to divide the BW. But
this is not how cpu controller operates and this approach I think was
initially implemented for group scheduling in cpu controller and later
changed.
How the proportional weight division is done among tasks is a property
of IO scheduler. cfq decides to use time slices according to priority
and bfq decides to use tokens. So probably we can't move this to common
elevator layer.
I think Satoshi's cfq controller patches also do not seem to be considering
A, B, C, D and E to be at same level, instead it treats cgroup "/" , D and E
at same level and tries to do proportional BW division among these.
Satoshi, please correct me, if that's not the case.
Above example, raises another question and that is what to do wih IO
schedulers which do not differentiate between tasks. For example, noop. It
simply has got one single linked list and does not have the notion of
io context and does not differentiate between IO coming from different
tasks. In that case probably we have no choice but to group A, B, C's bio
in root cgroup and do proportional weight division among "root", D and E
groups. I have not looked at deadline and AS yet.
So at this point of time I think that probably porting BFQ's hierarchical
scheduling implementation to other IO schedulers might make sense. Thoughts?
While doing this may be we can try to keep some functionality like cgroup
interface common among various IO schedulers.
Thanks
Vivek
On Thu, Nov 13, 2008 at 7:58 AM, Vivek Goyal <[email protected]> wrote:
>
> On Thu, Nov 13, 2008 at 06:05:58PM +0900, Ryo Tsuruta wrote:
> > Hi,
> >
> > From: [email protected]
> > Subject: [patch 0/4] [RFC] Another proportional weight IO controller
> > Date: Thu, 06 Nov 2008 10:30:22 -0500
> >
> > > Hi,
> > >
> > > If you are not already tired of so many io controller implementations, here
> > > is another one.
> > >
> > > This is a very eary very crude implementation to get early feedback to see
> > > if this approach makes any sense or not.
> > >
> > > This controller is a proportional weight IO controller primarily
> > > based on/inspired by dm-ioband. One of the things I personally found little
> > > odd about dm-ioband was need of a dm-ioband device for every device we want
> > > to control. I thought that probably we can make this control per request
> > > queue and get rid of device mapper driver. This should make configuration
> > > aspect easy.
> > >
> > > I have picked up quite some amount of code from dm-ioband especially for
> > > biocgroup implementation.
> > >
> > > I have done very basic testing and that is running 2-3 dd commands in different
> > > cgroups on x86_64. Wanted to throw out the code early to get some feedback.
> > >
> > > More details about the design and how to are in documentation patch.
> > >
> > > Your comments are welcome.
> >
> > Do you have any benchmark results?
> > I'm especially interested in the followings:
> > - Comparison of disk performance with and without the I/O controller patch.
>
> If I dynamically disable the bio control, then I did not observe any
> impact on performance. Because in that case practically it boils down
> to just an additional variable check in __make_request().
>
> > - Put uneven I/O loads. Processes, which belong to a cgroup which is
> > given a smaller weight than another cgroup, put heavier I/O load
> > like the following.
> >
> > echo 1024 > /cgroup/bio/test1/bio.shares
> > echo 8192 > /cgroup/bio/test2/bio.shares
> >
> > echo $$ > /cgroup/bio/test1/tasks
> > dd if=/somefile1-1 of=/dev/null &
> > dd if=/somefile1-2 of=/dev/null &
> > ...
> > dd if=/somefile1-100 of=/dev/null
> > echo $$ > /cgroup/bio/test2/tasks
> > dd if=/somefile2-1 of=/dev/null &
> > dd if=/somefile2-2 of=/dev/null &
> > ...
> > dd if=/somefile2-10 of=/dev/null &
>
> I have not tried this case.
>
> Ryo, do you still want to stick to two level scheduling? Given the problem
> of it breaking down underlying scheduler's assumptions, probably it makes
> more sense to the IO control at each individual IO scheduler.
Vivek,
I agree with you that 2 layer scheduler *might* invalidate some
IO scheduler assumptions (though some testing might help here to
confirm that). However, one big concern I have with proportional
division at the IO scheduler level is that there is no means of doing
admission control at the request queue for the device. What we need is
request queue partitioning per cgroup.
Consider that I want to divide my disk's bandwidth among 3
cgroups(A, B and C) equally. But say some tasks in the cgroup A flood
the disk with IO requests and completely use up all of the requests in
the rq resulting in the following IOs to be blocked on a slot getting
empty in the rq thus affecting their overall latency. One might argue
that over the long term though we'll get equal bandwidth division
between these cgroups. But now consider that cgroup A has tasks that
always storm the disk with large number of IOs which can be a problem
for other cgroups.
This actually becomes an even larger problem when we want to
support high priority requests as they may get blocked behind other
lower priority requests which have used up all the available requests
in the rq. With request queue division we can achieve this easily by
having tasks requiring high priority IO belong to a different cgroup.
dm-ioband and any other 2-level scheduler can do this easily.
-Divyesh
>
> I have had a very brief look at BFQ's hierarchical proportional
> weight/priority IO control and it looks good. May be we can adopt it for
> other IO schedulers also.
>
> Thanks
> Vivek
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
> From: Vivek Goyal <[email protected]>
> Date: Thu, Nov 13, 2008 01:08:21PM -0500
>
> On Wed, Nov 12, 2008 at 01:20:13PM -0800, Nauman Rafique wrote:
...
> > I was thinking of a more cfq-like solution for proportional division
> > at the elevator level (i.e. not a token based solution). There are two
> > options for proportional bandwidth division at elevator level: 1)
> > change the size of the time slice in proportion to the weights or 2)
> > allocate equal time slice each time but allocate more slices to cgroup
> > with more weight. For (2), we can actually keep track of time taken to
> > serve requests and allocate time slices in such a way that the actual
> > disk time is proportional to the weight. We can adopt a fair-queuing
> > (http://lkml.org/lkml/2008/4/1/234) like approach for this if we want
> > to go that way.
>
> Hi Nauman,
>
> I think doing proportional weight division at elevator level will be
> little difficult, because if we go for a full hierarchical solution then
> we will be doing proportional weight division among tasks as well as
> groups.
>
> For example, consider this. Assume at root level there are three tasks
> A, B, C and two cgroups D and E. Now for proportional weight division we
> should consider A, B, C, D and E at same level and then try to divide
> the BW (Thanks to peterz for clarifying this).
>
> Other approach could be that consider A, B, C in root cgroup and then
> consider root, D and E competing groups and try to divide the BW. But
> this is not how cpu controller operates and this approach I think was
> initially implemented for group scheduling in cpu controller and later
> changed.
>
> How the proportional weight division is done among tasks is a property
> of IO scheduler. cfq decides to use time slices according to priority
> and bfq decides to use tokens. So probably we can't move this to common
> elevator layer.
>
cfq and bfq are pretty similar in the concepts they adopt, and the pure
time-based approach of cfq can be extended to arbitrary hierarchies.
Even in bfq, when dealing with groups that generate only seeky traffic
we don't try to be fair in the service domain, as it would decrease too
much the aggregate throughput, but we fall back to a time-based approach.
[ This is a design choice, but it does not depend on the algorithms,
and of course can be changed... ]
The two approaches can be mixed/unified, for example, using wf2q+ to
schedule the slices, in the time domain, of cfq; the main remaining
difference would be the ability of bfq to provide service-domain
guarantees.
> I think Satoshi's cfq controller patches also do not seem to be considering
> A, B, C, D and E to be at same level, instead it treats cgroup "/" , D and E
> at same level and tries to do proportional BW division among these.
> Satoshi, please correct me, if that's not the case.
>
> Above example, raises another question and that is what to do wih IO
> schedulers which do not differentiate between tasks. For example, noop. It
> simply has got one single linked list and does not have the notion of
> io context and does not differentiate between IO coming from different
> tasks. In that case probably we have no choice but to group A, B, C's bio
> in root cgroup and do proportional weight division among "root", D and E
> groups. I have not looked at deadline and AS yet.
>
When you talk about grouping tasks into the root cgroup and then
scheduling inside the groups using an existing scheduler, do you mean
doing something like creating a ``little as'' or ``little noop'' queue
per each group, somehow like what happens with classless leaf qdiscs in
network scheduling, and then select first the leaf group to be scheduled,
and then using the per-leaf scheduler to select the request from the leaf?
A good thing about this approach would be that idling would still make
sense and the upper infrastructure would be the same for all the schedulers
(except for cfq and bfq, that in my opinion better fit the cpu scheduler's
hierarchical approach, with hierarchies confined into scheduling classes).
> So at this point of time I think that probably porting BFQ's hierarchical
> scheduling implementation to other IO schedulers might make sense. Thoughts?
>
IMO for cfq, given the similarities, this can be done without conceptual
problems. How to do that for schedulers like as, noop or deadline, and
if this is the best solution, is an interesting problem :)
On Thu, Nov 13, 2008 at 10:41:57AM -0800, Divyesh Shah wrote:
> On Thu, Nov 13, 2008 at 7:58 AM, Vivek Goyal <[email protected]> wrote:
> >
> > On Thu, Nov 13, 2008 at 06:05:58PM +0900, Ryo Tsuruta wrote:
> > > Hi,
> > >
> > > From: [email protected]
> > > Subject: [patch 0/4] [RFC] Another proportional weight IO controller
> > > Date: Thu, 06 Nov 2008 10:30:22 -0500
> > >
> > > > Hi,
> > > >
> > > > If you are not already tired of so many io controller implementations, here
> > > > is another one.
> > > >
> > > > This is a very eary very crude implementation to get early feedback to see
> > > > if this approach makes any sense or not.
> > > >
> > > > This controller is a proportional weight IO controller primarily
> > > > based on/inspired by dm-ioband. One of the things I personally found little
> > > > odd about dm-ioband was need of a dm-ioband device for every device we want
> > > > to control. I thought that probably we can make this control per request
> > > > queue and get rid of device mapper driver. This should make configuration
> > > > aspect easy.
> > > >
> > > > I have picked up quite some amount of code from dm-ioband especially for
> > > > biocgroup implementation.
> > > >
> > > > I have done very basic testing and that is running 2-3 dd commands in different
> > > > cgroups on x86_64. Wanted to throw out the code early to get some feedback.
> > > >
> > > > More details about the design and how to are in documentation patch.
> > > >
> > > > Your comments are welcome.
> > >
> > > Do you have any benchmark results?
> > > I'm especially interested in the followings:
> > > - Comparison of disk performance with and without the I/O controller patch.
> >
> > If I dynamically disable the bio control, then I did not observe any
> > impact on performance. Because in that case practically it boils down
> > to just an additional variable check in __make_request().
> >
> > > - Put uneven I/O loads. Processes, which belong to a cgroup which is
> > > given a smaller weight than another cgroup, put heavier I/O load
> > > like the following.
> > >
> > > echo 1024 > /cgroup/bio/test1/bio.shares
> > > echo 8192 > /cgroup/bio/test2/bio.shares
> > >
> > > echo $$ > /cgroup/bio/test1/tasks
> > > dd if=/somefile1-1 of=/dev/null &
> > > dd if=/somefile1-2 of=/dev/null &
> > > ...
> > > dd if=/somefile1-100 of=/dev/null
> > > echo $$ > /cgroup/bio/test2/tasks
> > > dd if=/somefile2-1 of=/dev/null &
> > > dd if=/somefile2-2 of=/dev/null &
> > > ...
> > > dd if=/somefile2-10 of=/dev/null &
> >
> > I have not tried this case.
> >
> > Ryo, do you still want to stick to two level scheduling? Given the problem
> > of it breaking down underlying scheduler's assumptions, probably it makes
> > more sense to the IO control at each individual IO scheduler.
>
> Vivek,
> I agree with you that 2 layer scheduler *might* invalidate some
> IO scheduler assumptions (though some testing might help here to
> confirm that). However, one big concern I have with proportional
> division at the IO scheduler level is that there is no means of doing
> admission control at the request queue for the device. What we need is
> request queue partitioning per cgroup.
> Consider that I want to divide my disk's bandwidth among 3
> cgroups(A, B and C) equally. But say some tasks in the cgroup A flood
> the disk with IO requests and completely use up all of the requests in
> the rq resulting in the following IOs to be blocked on a slot getting
> empty in the rq thus affecting their overall latency. One might argue
> that over the long term though we'll get equal bandwidth division
> between these cgroups. But now consider that cgroup A has tasks that
> always storm the disk with large number of IOs which can be a problem
> for other cgroups.
> This actually becomes an even larger problem when we want to
> support high priority requests as they may get blocked behind other
> lower priority requests which have used up all the available requests
> in the rq. With request queue division we can achieve this easily by
> having tasks requiring high priority IO belong to a different cgroup.
> dm-ioband and any other 2-level scheduler can do this easily.
>
Hi Divyesh,
I understand that request descriptors can be a bottleneck here. But that
should be an issue even today with CFQ where a low priority process
consume lots of request descriptors and prevent higher priority process
from submitting the request. I think you already said it and I just
reiterated it.
I think in that case we need to do something about request descriptor
allocation instead of relying on 2nd level of IO scheduler.
At this point I am not sure what to do. May be we can take feedback from the
respective queue (like cfqq) of submitting application and if it is already
backlogged beyond a certain limit, then we can put that application to sleep
and stop it from consuming excessive amount of request descriptors
(despite the fact that we have free request descriptors).
Thanks
Vivek
> -Divyesh
>
> >
> > I have had a very brief look at BFQ's hierarchical proportional
> > weight/priority IO control and it looks good. May be we can adopt it for
> > other IO schedulers also.
> >
> > Thanks
> > Vivek
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at http://www.tux.org/lkml/
On Thu, Nov 13, 2008 at 10:58:34AM -0500, Vivek Goyal wrote:
> On Thu, Nov 13, 2008 at 06:05:58PM +0900, Ryo Tsuruta wrote:
> > Hi,
> >
> > From: [email protected]
> > Subject: [patch 0/4] [RFC] Another proportional weight IO controller
> > Date: Thu, 06 Nov 2008 10:30:22 -0500
> >
> > > Hi,
> > >
> > > If you are not already tired of so many io controller implementations, here
> > > is another one.
> > >
> > > This is a very eary very crude implementation to get early feedback to see
> > > if this approach makes any sense or not.
> > >
> > > This controller is a proportional weight IO controller primarily
> > > based on/inspired by dm-ioband. One of the things I personally found little
> > > odd about dm-ioband was need of a dm-ioband device for every device we want
> > > to control. I thought that probably we can make this control per request
> > > queue and get rid of device mapper driver. This should make configuration
> > > aspect easy.
> > >
> > > I have picked up quite some amount of code from dm-ioband especially for
> > > biocgroup implementation.
> > >
> > > I have done very basic testing and that is running 2-3 dd commands in different
> > > cgroups on x86_64. Wanted to throw out the code early to get some feedback.
> > >
> > > More details about the design and how to are in documentation patch.
> > >
> > > Your comments are welcome.
> >
> > Do you have any benchmark results?
> > I'm especially interested in the followings:
> > - Comparison of disk performance with and without the I/O controller patch.
>
> If I dynamically disable the bio control, then I did not observe any
> impact on performance. Because in that case practically it boils down
> to just an additional variable check in __make_request().
>
Oh.., I understood your question wrong. You are looking for what's the
performance penalty if I enable the IO controller on a device.
I have not done any extensive benchmarking. If I run two dd commands
without controller, I get 80MB/s from disk (roughly 40 MB for each task).
With bio group enabled (default token=2000), I was getting total BW of
roughly 68 MB/s.
I have not done any performance analysis or optimizations at this point of
time. I plan to do that once we have some sort of common understanding about
a particular approach. There are so many IO controllers floating, right now
I am more concerned if we can all come to a common platform.
Thanks
Vivek
> > - Put uneven I/O loads. Processes, which belong to a cgroup which is
> > given a smaller weight than another cgroup, put heavier I/O load
> > like the following.
> >
> > echo 1024 > /cgroup/bio/test1/bio.shares
> > echo 8192 > /cgroup/bio/test2/bio.shares
> >
> > echo $$ > /cgroup/bio/test1/tasks
> > dd if=/somefile1-1 of=/dev/null &
> > dd if=/somefile1-2 of=/dev/null &
> > ...
> > dd if=/somefile1-100 of=/dev/null
> > echo $$ > /cgroup/bio/test2/tasks
> > dd if=/somefile2-1 of=/dev/null &
> > dd if=/somefile2-2 of=/dev/null &
> > ...
> > dd if=/somefile2-10 of=/dev/null &
>
> I have not tried this case.
>
> Ryo, do you still want to stick to two level scheduling? Given the problem
> of it breaking down underlying scheduler's assumptions, probably it makes
> more sense to the IO control at each individual IO scheduler.
>
> I have had a very brief look at BFQ's hierarchical proportional
> weight/priority IO control and it looks good. May be we can adopt it for
> other IO schedulers also.
>
> Thanks
> Vivek
On Thu, Nov 13, 2008 at 11:15 AM, Fabio Checconi <[email protected]> wrote:
>> From: Vivek Goyal <[email protected]>
>> Date: Thu, Nov 13, 2008 01:08:21PM -0500
>>
>> On Wed, Nov 12, 2008 at 01:20:13PM -0800, Nauman Rafique wrote:
> ...
>> > I was thinking of a more cfq-like solution for proportional division
>> > at the elevator level (i.e. not a token based solution). There are two
>> > options for proportional bandwidth division at elevator level: 1)
>> > change the size of the time slice in proportion to the weights or 2)
>> > allocate equal time slice each time but allocate more slices to cgroup
>> > with more weight. For (2), we can actually keep track of time taken to
>> > serve requests and allocate time slices in such a way that the actual
>> > disk time is proportional to the weight. We can adopt a fair-queuing
>> > (http://lkml.org/lkml/2008/4/1/234) like approach for this if we want
>> > to go that way.
>>
>> Hi Nauman,
>>
>> I think doing proportional weight division at elevator level will be
>> little difficult, because if we go for a full hierarchical solution then
>> we will be doing proportional weight division among tasks as well as
>> groups.
>>
>> For example, consider this. Assume at root level there are three tasks
>> A, B, C and two cgroups D and E. Now for proportional weight division we
>> should consider A, B, C, D and E at same level and then try to divide
>> the BW (Thanks to peterz for clarifying this).
>>
>> Other approach could be that consider A, B, C in root cgroup and then
>> consider root, D and E competing groups and try to divide the BW. But
>> this is not how cpu controller operates and this approach I think was
>> initially implemented for group scheduling in cpu controller and later
>> changed.
>>
>> How the proportional weight division is done among tasks is a property
>> of IO scheduler. cfq decides to use time slices according to priority
>> and bfq decides to use tokens. So probably we can't move this to common
>> elevator layer.
>>
>
> cfq and bfq are pretty similar in the concepts they adopt, and the pure
> time-based approach of cfq can be extended to arbitrary hierarchies.
>
> Even in bfq, when dealing with groups that generate only seeky traffic
> we don't try to be fair in the service domain, as it would decrease too
> much the aggregate throughput, but we fall back to a time-based approach.
>
> [ This is a design choice, but it does not depend on the algorithms,
> and of course can be changed... ]
>
> The two approaches can be mixed/unified, for example, using wf2q+ to
> schedule the slices, in the time domain, of cfq; the main remaining
> difference would be the ability of bfq to provide service-domain
> guarantees.
Before going into the design of elevator level scheduler, we should
have some consensus on abandoning the two level approach. Infact, it
would be useful if we had Ryo and Satoshi jump into this discussion
and express their opinion.
Having said that, I think I agree with Fabio on some mix/unification
of BFQ and CFQ patches, specially using wf2q+ to schedule time slices
would be a small patch into existing 2 level cfqs. That approach might
also reduce the resistance to have this in the tree, as CFQ is already
the default scheduler.
>
>
>> I think Satoshi's cfq controller patches also do not seem to be considering
>> A, B, C, D and E to be at same level, instead it treats cgroup "/" , D and E
>> at same level and tries to do proportional BW division among these.
>> Satoshi, please correct me, if that's not the case.
>>
>> Above example, raises another question and that is what to do wih IO
>> schedulers which do not differentiate between tasks. For example, noop. It
>> simply has got one single linked list and does not have the notion of
>> io context and does not differentiate between IO coming from different
>> tasks. In that case probably we have no choice but to group A, B, C's bio
>> in root cgroup and do proportional weight division among "root", D and E
>> groups. I have not looked at deadline and AS yet.
>>
>
> When you talk about grouping tasks into the root cgroup and then
> scheduling inside the groups using an existing scheduler, do you mean
> doing something like creating a ``little as'' or ``little noop'' queue
> per each group, somehow like what happens with classless leaf qdiscs in
> network scheduling, and then select first the leaf group to be scheduled,
> and then using the per-leaf scheduler to select the request from the leaf?
>
> A good thing about this approach would be that idling would still make
> sense and the upper infrastructure would be the same for all the schedulers
> (except for cfq and bfq, that in my opinion better fit the cpu scheduler's
> hierarchical approach, with hierarchies confined into scheduling classes).
>
>
>> So at this point of time I think that probably porting BFQ's hierarchical
>> scheduling implementation to other IO schedulers might make sense. Thoughts?
>>
>
> IMO for cfq, given the similarities, this can be done without conceptual
> problems. How to do that for schedulers like as, noop or deadline, and
> if this is the best solution, is an interesting problem :)
It might be a little too early to start patching things into other
schedulers. First, because we still don't have a common ground on the
exact approach for proportional bandwidth division. Second, if
somebody is using vanilla no-op, deadline or as, do they really care
about proportional division? If they did, they would probably be using
cfq already. So we can have something going for cfq first, and then we
can move to other schedulers.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
On Thu, Nov 13, 2008 at 1:46 PM, Vivek Goyal <[email protected]> wrote:
>
> On Thu, Nov 13, 2008 at 10:41:57AM -0800, Divyesh Shah wrote:
> > On Thu, Nov 13, 2008 at 7:58 AM, Vivek Goyal <[email protected]> wrote:
> > >
> > > On Thu, Nov 13, 2008 at 06:05:58PM +0900, Ryo Tsuruta wrote:
> > > > Hi,
> > > >
> > > > From: [email protected]
> > > > Subject: [patch 0/4] [RFC] Another proportional weight IO controller
> > > > Date: Thu, 06 Nov 2008 10:30:22 -0500
> > > >
> > > > > Hi,
> > > > >
> > > > > If you are not already tired of so many io controller implementations, here
> > > > > is another one.
> > > > >
> > > > > This is a very eary very crude implementation to get early feedback to see
> > > > > if this approach makes any sense or not.
> > > > >
> > > > > This controller is a proportional weight IO controller primarily
> > > > > based on/inspired by dm-ioband. One of the things I personally found little
> > > > > odd about dm-ioband was need of a dm-ioband device for every device we want
> > > > > to control. I thought that probably we can make this control per request
> > > > > queue and get rid of device mapper driver. This should make configuration
> > > > > aspect easy.
> > > > >
> > > > > I have picked up quite some amount of code from dm-ioband especially for
> > > > > biocgroup implementation.
> > > > >
> > > > > I have done very basic testing and that is running 2-3 dd commands in different
> > > > > cgroups on x86_64. Wanted to throw out the code early to get some feedback.
> > > > >
> > > > > More details about the design and how to are in documentation patch.
> > > > >
> > > > > Your comments are welcome.
> > > >
> > > > Do you have any benchmark results?
> > > > I'm especially interested in the followings:
> > > > - Comparison of disk performance with and without the I/O controller patch.
> > >
> > > If I dynamically disable the bio control, then I did not observe any
> > > impact on performance. Because in that case practically it boils down
> > > to just an additional variable check in __make_request().
> > >
> > > > - Put uneven I/O loads. Processes, which belong to a cgroup which is
> > > > given a smaller weight than another cgroup, put heavier I/O load
> > > > like the following.
> > > >
> > > > echo 1024 > /cgroup/bio/test1/bio.shares
> > > > echo 8192 > /cgroup/bio/test2/bio.shares
> > > >
> > > > echo $$ > /cgroup/bio/test1/tasks
> > > > dd if=/somefile1-1 of=/dev/null &
> > > > dd if=/somefile1-2 of=/dev/null &
> > > > ...
> > > > dd if=/somefile1-100 of=/dev/null
> > > > echo $$ > /cgroup/bio/test2/tasks
> > > > dd if=/somefile2-1 of=/dev/null &
> > > > dd if=/somefile2-2 of=/dev/null &
> > > > ...
> > > > dd if=/somefile2-10 of=/dev/null &
> > >
> > > I have not tried this case.
> > >
> > > Ryo, do you still want to stick to two level scheduling? Given the problem
> > > of it breaking down underlying scheduler's assumptions, probably it makes
> > > more sense to the IO control at each individual IO scheduler.
> >
> > Vivek,
> > I agree with you that 2 layer scheduler *might* invalidate some
> > IO scheduler assumptions (though some testing might help here to
> > confirm that). However, one big concern I have with proportional
> > division at the IO scheduler level is that there is no means of doing
> > admission control at the request queue for the device. What we need is
> > request queue partitioning per cgroup.
> > Consider that I want to divide my disk's bandwidth among 3
> > cgroups(A, B and C) equally. But say some tasks in the cgroup A flood
> > the disk with IO requests and completely use up all of the requests in
> > the rq resulting in the following IOs to be blocked on a slot getting
> > empty in the rq thus affecting their overall latency. One might argue
> > that over the long term though we'll get equal bandwidth division
> > between these cgroups. But now consider that cgroup A has tasks that
> > always storm the disk with large number of IOs which can be a problem
> > for other cgroups.
> > This actually becomes an even larger problem when we want to
> > support high priority requests as they may get blocked behind other
> > lower priority requests which have used up all the available requests
> > in the rq. With request queue division we can achieve this easily by
> > having tasks requiring high priority IO belong to a different cgroup.
> > dm-ioband and any other 2-level scheduler can do this easily.
> >
>
> Hi Divyesh,
>
> I understand that request descriptors can be a bottleneck here. But that
> should be an issue even today with CFQ where a low priority process
> consume lots of request descriptors and prevent higher priority process
> from submitting the request.
Yes that is true and that is one of the main reasons why I would lean
towards 2-level scheduler coz you get request queue division as well.
I think you already said it and I just
> reiterated it.
>
> I think in that case we need to do something about request descriptor
> allocation instead of relying on 2nd level of IO scheduler.
> At this point I am not sure what to do. May be we can take feedback from the
> respective queue (like cfqq) of submitting application and if it is already
> backlogged beyond a certain limit, then we can put that application to sleep
> and stop it from consuming excessive amount of request descriptors
> (despite the fact that we have free request descriptors).
This should be done per-cgroup rather than per-process.
IMHO, to abandon the 2-level approach without having a solid plan for
tackling this issue might not be the best idea coz that will
invalidate the SLA that the proportional b/w controller promises.
>
> Thanks
> Vivek
>
> > -Divyesh
> >
> > >
> > > I have had a very brief look at BFQ's hierarchical proportional
> > > weight/priority IO control and it looks good. May be we can adopt it for
> > > other IO schedulers also.
> > >
> > > Thanks
> > > Vivek
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > > the body of a message to [email protected]
> > > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > > Please read the FAQ at http://www.tux.org/lkml/
> From: Nauman Rafique <[email protected]>
> Date: Thu, Nov 13, 2008 02:27:41PM -0800
>
> On Thu, Nov 13, 2008 at 11:15 AM, Fabio Checconi <[email protected]> wrote:
> >> How the proportional weight division is done among tasks is a property
> >> of IO scheduler. cfq decides to use time slices according to priority
> >> and bfq decides to use tokens. So probably we can't move this to common
> >> elevator layer.
> >>
> >
> > cfq and bfq are pretty similar in the concepts they adopt, and the pure
> > time-based approach of cfq can be extended to arbitrary hierarchies.
> >
> > Even in bfq, when dealing with groups that generate only seeky traffic
> > we don't try to be fair in the service domain, as it would decrease too
> > much the aggregate throughput, but we fall back to a time-based approach.
> >
> > [ This is a design choice, but it does not depend on the algorithms,
> > and of course can be changed... ]
> >
> > The two approaches can be mixed/unified, for example, using wf2q+ to
> > schedule the slices, in the time domain, of cfq; the main remaining
> > difference would be the ability of bfq to provide service-domain
> > guarantees.
>
> Before going into the design of elevator level scheduler, we should
> have some consensus on abandoning the two level approach. Infact, it
> would be useful if we had Ryo and Satoshi jump into this discussion
> and express their opinion.
>
You're right. I was only trying to give some design elements thinking
that they could help in the evaluation of the two approaches, talking
of the one I know better.
...
> >> So at this point of time I think that probably porting BFQ's hierarchical
> >> scheduling implementation to other IO schedulers might make sense. Thoughts?
> >>
> >
> > IMO for cfq, given the similarities, this can be done without conceptual
> > problems. How to do that for schedulers like as, noop or deadline, and
> > if this is the best solution, is an interesting problem :)
>
> It might be a little too early to start patching things into other
> schedulers. First, because we still don't have a common ground on the
> exact approach for proportional bandwidth division. Second, if
> somebody is using vanilla no-op, deadline or as, do they really care
> about proportional division? If they did, they would probably be using
> cfq already. So we can have something going for cfq first, and then we
> can move to other schedulers.
>
Sorry for being unclear, I didn't want to start patching anything. In my
opinion a hierarchical extension of as/deadline poses some interesting
scheduling issues, and exploring them can help in making better decisions.
Hi, Vivek
>
> I think doing proportional weight division at elevator level will be
> little difficult, because if we go for a full hierarchical solution then
> we will be doing proportional weight division among tasks as well as
> groups.
>
> For example, consider this. Assume at root level there are three tasks
> A, B, C and two cgroups D and E. Now for proportional weight division we
> should consider A, B, C, D and E at same level and then try to divide
> the BW (Thanks to peterz for clarifying this).
>
> Other approach could be that consider A, B, C in root cgroup and then
> consider root, D and E competing groups and try to divide the BW. But
> this is not how cpu controller operates and this approach I think was
> initially implemented for group scheduling in cpu controller and later
> changed.
>
> How the proportional weight division is done among tasks is a property
> of IO scheduler. cfq decides to use time slices according to priority
> and bfq decides to use tokens. So probably we can't move this to common
> elevator layer.
>
> I think Satoshi's cfq controller patches also do not seem to be considering
> A, B, C, D and E to be at same level, instead it treats cgroup "/" , D and
> E
> at same level and tries to do proportional BW division among these.
> Satoshi, please correct me, if that's not the case.
>
Yes.
I think that a controller should be divided share among "/(root)" and two groups.
This reason is follows:
* If these tasks are handled at same level, it is enough by using a traditional
CFQ scheduler.
If you want to make all tasks in the same group the same priority(parameter),
It is not I/O control but is parameter control.
* I think that the group means the environment which makes some sense and
user want to control I/O per groups.
Next, the group is the environment. So, tasks within the group will have
priorities for themselves respectively as traditional environment.
Of course, group may not be need to control I/O.
In such time, a ioprio of tasks should be set the same priority.
Therefore, our scheduler controls among group and then among tasks
Satoshi UCHIDA
On Fri, 2008-11-14 at 13:58 +0900, Satoshi UCHIDA wrote:
> > I think Satoshi's cfq controller patches also do not seem to be considering
> > A, B, C, D and E to be at same level, instead it treats cgroup "/" , D and
> > E
> > at same level and tries to do proportional BW division among these.
> > Satoshi, please correct me, if that's not the case.
> >
>
> Yes.
> I think that a controller should be divided share among "/(root)" and two groups.
> This reason is follows:
>
> * If these tasks are handled at same level, it is enough by using a traditional
> CFQ scheduler.
> If you want to make all tasks in the same group the same priority(parameter),
> It is not I/O control but is parameter control.
>
> * I think that the group means the environment which makes some sense and
> user want to control I/O per groups.
> Next, the group is the environment. So, tasks within the group will have
> priorities for themselves respectively as traditional environment.
> Of course, group may not be need to control I/O.
> In such time, a ioprio of tasks should be set the same priority.
>
> Therefore, our scheduler controls among group and then among tasks
I would suggest abandoning this scheme as its different from how the CPU
scheduler does it. The CPU scheduler is fully hierarchical and tasks in
"/" are on the same level as groups in "/".
That is, we do:
root
/ | \
1 2 A
/ \
B 3
/ \
4 5
Where digits are tasks, and letters are groups.
Having the two bandwidth (CPU, I/O) doing different things wrt grouping
can only be confusing at best.
Hi, Vivek.
> > I think that a controller should be divided share among "/(root)" and
> two groups.
> > This reason is follows:
> >
> > * If these tasks are handled at same level, it is enough by using a
> traditional
> > CFQ scheduler.
> > If you want to make all tasks in the same group the same
> priority(parameter),
> > It is not I/O control but is parameter control.
> >
> > * I think that the group means the environment which makes some sense
> and
> > user want to control I/O per groups.
> > Next, the group is the environment. So, tasks within the group will
> have
> > priorities for themselves respectively as traditional environment.
> > Of course, group may not be need to control I/O.
> > In such time, a ioprio of tasks should be set the same priority.
> >
> > Therefore, our scheduler controls among group and then among tasks
>
> I would suggest abandoning this scheme as its different from how the CPU
> scheduler does it. The CPU scheduler is fully hierarchical and tasks in
> "/" are on the same level as groups in "/".
>
> That is, we do:
>
> root
> / | \
> 1 2 A
> / \
> B 3
> / \
> 4 5
>
> Where digits are tasks, and letters are groups.
>
> Having the two bandwidth (CPU, I/O) doing different things wrt grouping
> can only be confusing at best.
>
I understand what you mean is as follows.
CPU power for 4 is calculated by 100% * a ratio of A (among 1, 2 and A) *
a ratio of B (among 3 and B) * ratio of 4 (among 4 and 5).
However, I/O power for 4 is calculated by 100% * a ratio of B (among root, A and B) *
a ratio of 4 (among 4 and 5).
Therefore, its power expression is a different and then user will confuse.
So, in other cgroups controllers, children(tasks and groups) of a group are flat,
but in CFQ cgroups controllers, all groups are flat.
I agree this opinion.
I think that CFQ should support multiple layers, and
its improvement would be easy (by nesting cfq_data tree) probably.
On Thu, Nov 13, 2008 at 02:57:29PM -0800, Divyesh Shah wrote:
[..]
> > > > Ryo, do you still want to stick to two level scheduling? Given the problem
> > > > of it breaking down underlying scheduler's assumptions, probably it makes
> > > > more sense to the IO control at each individual IO scheduler.
> > >
> > > Vivek,
> > > I agree with you that 2 layer scheduler *might* invalidate some
> > > IO scheduler assumptions (though some testing might help here to
> > > confirm that). However, one big concern I have with proportional
> > > division at the IO scheduler level is that there is no means of doing
> > > admission control at the request queue for the device. What we need is
> > > request queue partitioning per cgroup.
> > > Consider that I want to divide my disk's bandwidth among 3
> > > cgroups(A, B and C) equally. But say some tasks in the cgroup A flood
> > > the disk with IO requests and completely use up all of the requests in
> > > the rq resulting in the following IOs to be blocked on a slot getting
> > > empty in the rq thus affecting their overall latency. One might argue
> > > that over the long term though we'll get equal bandwidth division
> > > between these cgroups. But now consider that cgroup A has tasks that
> > > always storm the disk with large number of IOs which can be a problem
> > > for other cgroups.
> > > This actually becomes an even larger problem when we want to
> > > support high priority requests as they may get blocked behind other
> > > lower priority requests which have used up all the available requests
> > > in the rq. With request queue division we can achieve this easily by
> > > having tasks requiring high priority IO belong to a different cgroup.
> > > dm-ioband and any other 2-level scheduler can do this easily.
> > >
> >
> > Hi Divyesh,
> >
> > I understand that request descriptors can be a bottleneck here. But that
> > should be an issue even today with CFQ where a low priority process
> > consume lots of request descriptors and prevent higher priority process
> > from submitting the request.
>
> Yes that is true and that is one of the main reasons why I would lean
> towards 2-level scheduler coz you get request queue division as well.
>
> I think you already said it and I just
> > reiterated it.
> >
> > I think in that case we need to do something about request descriptor
> > allocation instead of relying on 2nd level of IO scheduler.
> > At this point I am not sure what to do. May be we can take feedback from the
> > respective queue (like cfqq) of submitting application and if it is already
> > backlogged beyond a certain limit, then we can put that application to sleep
> > and stop it from consuming excessive amount of request descriptors
> > (despite the fact that we have free request descriptors).
>
> This should be done per-cgroup rather than per-process.
>
Yep, per cgroup limit will make more sense. get_request() already calls
elv_may_queue() to get a feedback from IO scheduler. May be here IO
scheduler can make a decision how many request descriptors are already
allocated to this cgroup. And if the queue is congested, then IO scheduler
can deny the fresh request allocation.
Thanks
Vivek
In an attempt to make sure that this discussion leads to
something useful, we have summarized the points raised in this
discussion and have come up with a strategy for future.
The goal of this is to find common ground between all the approaches
proposed on this mailing list.
1 Start with Satoshi's latest patches.
2 Do the following to support propotional division:
a) Give time slices in proportion to weights (configurable
option). We can support both priorities and weights by doing
propotional division between requests with same priorities.
3 Schedule time slices using WF2Q+ instead of round robin.
Test the performance impact (both throughput and jitter in latency).
4 Do the following to support the goals of 2 level schedulers:
a) Limit the request descriptors allocated to each cgroup by adding
functionality to elv_may_queue()
b) Add support for putting an absolute limit on IO consumed by a
cgroup. Such support exists in dm-ioband and is provided by Andrea
Righi's patches too.
c) Add support (configurable option) to keep track of total disk
time/sectors/count
consumed at each device, and factor that into scheduling decision
(more discussion needed here)
5 Support multiple layers of cgroups to align IO controller behavior
with CPU scheduling behavior (more discussion?)
6 Incorporate an IO tracking approach which re-uses memory resource
controller code but is not dependent on it (may be biocgroup patches from
dm-ioband can be used here directly)
7 Start an offline email thread to keep track of progress on the above
goals.
Please feel free to add/modify items to the list
when you respond back. Any comments/suggestions are more than welcome.
Thanks.
Divyesh & Nauman
On Fri, Nov 14, 2008 at 8:05 AM, Vivek Goyal <[email protected]> wrote:
> On Thu, Nov 13, 2008 at 02:57:29PM -0800, Divyesh Shah wrote:
>
> [..]
>> > > > Ryo, do you still want to stick to two level scheduling? Given the problem
>> > > > of it breaking down underlying scheduler's assumptions, probably it makes
>> > > > more sense to the IO control at each individual IO scheduler.
>> > >
>> > > Vivek,
>> > > I agree with you that 2 layer scheduler *might* invalidate some
>> > > IO scheduler assumptions (though some testing might help here to
>> > > confirm that). However, one big concern I have with proportional
>> > > division at the IO scheduler level is that there is no means of doing
>> > > admission control at the request queue for the device. What we need is
>> > > request queue partitioning per cgroup.
>> > > Consider that I want to divide my disk's bandwidth among 3
>> > > cgroups(A, B and C) equally. But say some tasks in the cgroup A flood
>> > > the disk with IO requests and completely use up all of the requests in
>> > > the rq resulting in the following IOs to be blocked on a slot getting
>> > > empty in the rq thus affecting their overall latency. One might argue
>> > > that over the long term though we'll get equal bandwidth division
>> > > between these cgroups. But now consider that cgroup A has tasks that
>> > > always storm the disk with large number of IOs which can be a problem
>> > > for other cgroups.
>> > > This actually becomes an even larger problem when we want to
>> > > support high priority requests as they may get blocked behind other
>> > > lower priority requests which have used up all the available requests
>> > > in the rq. With request queue division we can achieve this easily by
>> > > having tasks requiring high priority IO belong to a different cgroup.
>> > > dm-ioband and any other 2-level scheduler can do this easily.
>> > >
>> >
>> > Hi Divyesh,
>> >
>> > I understand that request descriptors can be a bottleneck here. But that
>> > should be an issue even today with CFQ where a low priority process
>> > consume lots of request descriptors and prevent higher priority process
>> > from submitting the request.
>>
>> Yes that is true and that is one of the main reasons why I would lean
>> towards 2-level scheduler coz you get request queue division as well.
>>
>> I think you already said it and I just
>> > reiterated it.
>> >
>> > I think in that case we need to do something about request descriptor
>> > allocation instead of relying on 2nd level of IO scheduler.
>> > At this point I am not sure what to do. May be we can take feedback from the
>> > respective queue (like cfqq) of submitting application and if it is already
>> > backlogged beyond a certain limit, then we can put that application to sleep
>> > and stop it from consuming excessive amount of request descriptors
>> > (despite the fact that we have free request descriptors).
>>
>> This should be done per-cgroup rather than per-process.
>>
>
> Yep, per cgroup limit will make more sense. get_request() already calls
> elv_may_queue() to get a feedback from IO scheduler. May be here IO
> scheduler can make a decision how many request descriptors are already
> allocated to this cgroup. And if the queue is congested, then IO scheduler
> can deny the fresh request allocation.
>
> Thanks
> Vivek
>
On Fri, Nov 14, 2008 at 02:44:22PM -0800, Nauman Rafique wrote:
> In an attempt to make sure that this discussion leads to
> something useful, we have summarized the points raised in this
> discussion and have come up with a strategy for future.
> The goal of this is to find common ground between all the approaches
> proposed on this mailing list.
>
> 1 Start with Satoshi's latest patches.
I have had a brief look at both Satoshi's patch and bfq. I kind of like
bfq's patches for keeping track of per cgroup, per queue data structures.
May be we can look there also.
> 2 Do the following to support propotional division:
> a) Give time slices in proportion to weights (configurable
> option). We can support both priorities and weights by doing
> propotional division between requests with same priorities.
> 3 Schedule time slices using WF2Q+ instead of round robin.
> Test the performance impact (both throughput and jitter in latency).
> 4 Do the following to support the goals of 2 level schedulers:
> a) Limit the request descriptors allocated to each cgroup by adding
> functionality to elv_may_queue()
> b) Add support for putting an absolute limit on IO consumed by a
> cgroup. Such support exists in dm-ioband and is provided by Andrea
> Righi's patches too.
Does dm-iobnd support abosolute limit? I think till last version they did
not. I have not check the latest version though.
> c) Add support (configurable option) to keep track of total disk
> time/sectors/count
> consumed at each device, and factor that into scheduling decision
> (more discussion needed here)
> 5 Support multiple layers of cgroups to align IO controller behavior
> with CPU scheduling behavior (more discussion?)
> 6 Incorporate an IO tracking approach which re-uses memory resource
> controller code but is not dependent on it (may be biocgroup patches from
> dm-ioband can be used here directly)
> 7 Start an offline email thread to keep track of progress on the above
> goals.
>
> Please feel free to add/modify items to the list
> when you respond back. Any comments/suggestions are more than welcome.
>
Thanks
Vivek
> Thanks.
> Divyesh & Nauman
>
> On Fri, Nov 14, 2008 at 8:05 AM, Vivek Goyal <[email protected]> wrote:
> > On Thu, Nov 13, 2008 at 02:57:29PM -0800, Divyesh Shah wrote:
> >
> > [..]
> >> > > > Ryo, do you still want to stick to two level scheduling? Given the problem
> >> > > > of it breaking down underlying scheduler's assumptions, probably it makes
> >> > > > more sense to the IO control at each individual IO scheduler.
> >> > >
> >> > > Vivek,
> >> > > I agree with you that 2 layer scheduler *might* invalidate some
> >> > > IO scheduler assumptions (though some testing might help here to
> >> > > confirm that). However, one big concern I have with proportional
> >> > > division at the IO scheduler level is that there is no means of doing
> >> > > admission control at the request queue for the device. What we need is
> >> > > request queue partitioning per cgroup.
> >> > > Consider that I want to divide my disk's bandwidth among 3
> >> > > cgroups(A, B and C) equally. But say some tasks in the cgroup A flood
> >> > > the disk with IO requests and completely use up all of the requests in
> >> > > the rq resulting in the following IOs to be blocked on a slot getting
> >> > > empty in the rq thus affecting their overall latency. One might argue
> >> > > that over the long term though we'll get equal bandwidth division
> >> > > between these cgroups. But now consider that cgroup A has tasks that
> >> > > always storm the disk with large number of IOs which can be a problem
> >> > > for other cgroups.
> >> > > This actually becomes an even larger problem when we want to
> >> > > support high priority requests as they may get blocked behind other
> >> > > lower priority requests which have used up all the available requests
> >> > > in the rq. With request queue division we can achieve this easily by
> >> > > having tasks requiring high priority IO belong to a different cgroup.
> >> > > dm-ioband and any other 2-level scheduler can do this easily.
> >> > >
> >> >
> >> > Hi Divyesh,
> >> >
> >> > I understand that request descriptors can be a bottleneck here. But that
> >> > should be an issue even today with CFQ where a low priority process
> >> > consume lots of request descriptors and prevent higher priority process
> >> > from submitting the request.
> >>
> >> Yes that is true and that is one of the main reasons why I would lean
> >> towards 2-level scheduler coz you get request queue division as well.
> >>
> >> I think you already said it and I just
> >> > reiterated it.
> >> >
> >> > I think in that case we need to do something about request descriptor
> >> > allocation instead of relying on 2nd level of IO scheduler.
> >> > At this point I am not sure what to do. May be we can take feedback from the
> >> > respective queue (like cfqq) of submitting application and if it is already
> >> > backlogged beyond a certain limit, then we can put that application to sleep
> >> > and stop it from consuming excessive amount of request descriptors
> >> > (despite the fact that we have free request descriptors).
> >>
> >> This should be done per-cgroup rather than per-process.
> >>
> >
> > Yep, per cgroup limit will make more sense. get_request() already calls
> > elv_may_queue() to get a feedback from IO scheduler. May be here IO
> > scheduler can make a decision how many request descriptors are already
> > allocated to this cgroup. And if the queue is congested, then IO scheduler
> > can deny the fresh request allocation.
> >
> > Thanks
> > Vivek
> >
Vivek Goyal wrote:
> On Fri, Nov 14, 2008 at 02:44:22PM -0800, Nauman Rafique wrote:
>> In an attempt to make sure that this discussion leads to
>> something useful, we have summarized the points raised in this
>> discussion and have come up with a strategy for future.
>> The goal of this is to find common ground between all the approaches
>> proposed on this mailing list.
>>
>> 1 Start with Satoshi's latest patches.
>
> I have had a brief look at both Satoshi's patch and bfq. I kind of like
> bfq's patches for keeping track of per cgroup, per queue data structures.
> May be we can look there also.
>
>> 2 Do the following to support propotional division:
>> a) Give time slices in proportion to weights (configurable
>> option). We can support both priorities and weights by doing
>> propotional division between requests with same priorities.
>> 3 Schedule time slices using WF2Q+ instead of round robin.
>> Test the performance impact (both throughput and jitter in latency).
>> 4 Do the following to support the goals of 2 level schedulers:
>> a) Limit the request descriptors allocated to each cgroup by adding
>> functionality to elv_may_queue()
>> b) Add support for putting an absolute limit on IO consumed by a
>> cgroup. Such support exists in dm-ioband and is provided by Andrea
>> Righi's patches too.
>
> Does dm-iobnd support abosolute limit? I think till last version they did
> not. I have not check the latest version though.
>
No, dm-ioband still provides weight/share control only. Only Andrea Righi's
patches support absolute limit.
>> c) Add support (configurable option) to keep track of total disk
>> time/sectors/count
>> consumed at each device, and factor that into scheduling decision
>> (more discussion needed here)
>> 5 Support multiple layers of cgroups to align IO controller behavior
>> with CPU scheduling behavior (more discussion?)
>> 6 Incorporate an IO tracking approach which re-uses memory resource
>> controller code but is not dependent on it (may be biocgroup patches from
>> dm-ioband can be used here directly)
>> 7 Start an offline email thread to keep track of progress on the above
>> goals.
>>
>> Please feel free to add/modify items to the list
>> when you respond back. Any comments/suggestions are more than welcome.
>>
If we start with bfq patches, this is how plan would look like:
1 Start with BFQ take 2.
2 Do the following to support proportional division:
a) Expose the per device weight interface to user, instead of calculating
from priority.
b) Add support for disk time budgets, besides sector budget that is currently
available (configurable option). (Fabio: Do you think we can just emulate
that using the existing code?). Another approach would be to give time slices
just like CFQ (discussing?)
4 Do the following to support the goals of 2 level schedulers:
a) Limit the request descriptors allocated to each cgroup by adding
functionality to elv_may_queue()
b) Add support for putting an absolute limit on IO consumed by a
cgroup. Such support is provided by Andrea
Righi's patches too.
c) Add support (configurable option) to keep track of total disk
time/sectors/count
consumed at each device, and factor that into scheduling decision
(more discussion needed here)
6 Incorporate an IO tracking approach which re-uses memory resource
controller code but is not dependent on it (may be biocgroup patches from
dm-ioband can be used here directly)
7 Start an offline email thread to keep track of progress on the above
goals.
BFQ's support for hierarchy of cgroups means that its close to where
we want to get. Any comments on what approach looks better?
On Mon, Nov 17, 2008 at 6:02 PM, Li Zefan <[email protected]> wrote:
> Vivek Goyal wrote:
>> On Fri, Nov 14, 2008 at 02:44:22PM -0800, Nauman Rafique wrote:
>>> In an attempt to make sure that this discussion leads to
>>> something useful, we have summarized the points raised in this
>>> discussion and have come up with a strategy for future.
>>> The goal of this is to find common ground between all the approaches
>>> proposed on this mailing list.
>>>
>>> 1 Start with Satoshi's latest patches.
>>
>> I have had a brief look at both Satoshi's patch and bfq. I kind of like
>> bfq's patches for keeping track of per cgroup, per queue data structures.
>> May be we can look there also.
>>
>>> 2 Do the following to support propotional division:
>>> a) Give time slices in proportion to weights (configurable
>>> option). We can support both priorities and weights by doing
>>> propotional division between requests with same priorities.
>>> 3 Schedule time slices using WF2Q+ instead of round robin.
>>> Test the performance impact (both throughput and jitter in latency).
>>> 4 Do the following to support the goals of 2 level schedulers:
>>> a) Limit the request descriptors allocated to each cgroup by adding
>>> functionality to elv_may_queue()
>>> b) Add support for putting an absolute limit on IO consumed by a
>>> cgroup. Such support exists in dm-ioband and is provided by Andrea
>>> Righi's patches too.
>>
>> Does dm-iobnd support abosolute limit? I think till last version they did
>> not. I have not check the latest version though.
>>
>
> No, dm-ioband still provides weight/share control only. Only Andrea Righi's
> patches support absolute limit.
Thanks for the correction.
>
>>> c) Add support (configurable option) to keep track of total disk
>>> time/sectors/count
>>> consumed at each device, and factor that into scheduling decision
>>> (more discussion needed here)
>>> 5 Support multiple layers of cgroups to align IO controller behavior
>>> with CPU scheduling behavior (more discussion?)
>>> 6 Incorporate an IO tracking approach which re-uses memory resource
>>> controller code but is not dependent on it (may be biocgroup patches from
>>> dm-ioband can be used here directly)
>>> 7 Start an offline email thread to keep track of progress on the above
>>> goals.
>>>
>>> Please feel free to add/modify items to the list
>>> when you respond back. Any comments/suggestions are more than welcome.
>>>
>
>
Nauman Rafique wrote:
> If we start with bfq patches, this is how plan would look like:
>
> 1 Start with BFQ take 2.
> 2 Do the following to support proportional division:
> a) Expose the per device weight interface to user, instead of calculating
> from priority.
> b) Add support for disk time budgets, besides sector budget that is currently
> available (configurable option). (Fabio: Do you think we can just emulate
> that using the existing code?). Another approach would be to give time slices
> just like CFQ (discussing?)
> 4 Do the following to support the goals of 2 level schedulers:
> a) Limit the request descriptors allocated to each cgroup by adding
> functionality to elv_may_queue()
> b) Add support for putting an absolute limit on IO consumed by a
> cgroup. Such support is provided by Andrea
> Righi's patches too.
> c) Add support (configurable option) to keep track of total disk
> time/sectors/count
> consumed at each device, and factor that into scheduling decision
> (more discussion needed here)
> 6 Incorporate an IO tracking approach which re-uses memory resource
> controller code but is not dependent on it (may be biocgroup patches from
> dm-ioband can be used here directly)
The newest bio_cgroup doesn't use much memcg code I think. The older biocgroup
tracks IO using mem_cgroup_charge(), and mem_cgroup_charge() remembers a struct page
owns by which cgroup. But now biocgroup changes to directly put some hooks in
__set_page_dirty() and some other places to track pages.
> 7 Start an offline email thread to keep track of progress on the above
> goals.
>
> BFQ's support for hierarchy of cgroups means that its close to where
> we want to get. Any comments on what approach looks better?
>
Looks like a sane way :) . We are also trying to keep track of the discussion and
development of IO controller. I'll start to have a look into BFQ.
> On Mon, Nov 17, 2008 at 6:02 PM, Li Zefan <[email protected]> wrote:
>> Vivek Goyal wrote:
>>> On Fri, Nov 14, 2008 at 02:44:22PM -0800, Nauman Rafique wrote:
>>>> In an attempt to make sure that this discussion leads to
>>>> something useful, we have summarized the points raised in this
>>>> discussion and have come up with a strategy for future.
>>>> The goal of this is to find common ground between all the approaches
>>>> proposed on this mailing list.
>>>>
>>>> 1 Start with Satoshi's latest patches.
>>> I have had a brief look at both Satoshi's patch and bfq. I kind of like
>>> bfq's patches for keeping track of per cgroup, per queue data structures.
>>> May be we can look there also.
>>>
>>>> 2 Do the following to support propotional division:
>>>> a) Give time slices in proportion to weights (configurable
>>>> option). We can support both priorities and weights by doing
>>>> propotional division between requests with same priorities.
>>>> 3 Schedule time slices using WF2Q+ instead of round robin.
>>>> Test the performance impact (both throughput and jitter in latency).
>>>> 4 Do the following to support the goals of 2 level schedulers:
>>>> a) Limit the request descriptors allocated to each cgroup by adding
>>>> functionality to elv_may_queue()
>>>> b) Add support for putting an absolute limit on IO consumed by a
>>>> cgroup. Such support exists in dm-ioband and is provided by Andrea
>>>> Righi's patches too.
>>> Does dm-iobnd support abosolute limit? I think till last version they did
>>> not. I have not check the latest version though.
>>>
>> No, dm-ioband still provides weight/share control only. Only Andrea Righi's
>> patches support absolute limit.
>
> Thanks for the correction.
>
>>>> c) Add support (configurable option) to keep track of total disk
>>>> time/sectors/count
>>>> consumed at each device, and factor that into scheduling decision
>>>> (more discussion needed here)
>>>> 5 Support multiple layers of cgroups to align IO controller behavior
>>>> with CPU scheduling behavior (more discussion?)
>>>> 6 Incorporate an IO tracking approach which re-uses memory resource
>>>> controller code but is not dependent on it (may be biocgroup patches from
>>>> dm-ioband can be used here directly)
>>>> 7 Start an offline email thread to keep track of progress on the above
>>>> goals.
>>>>
>>>> Please feel free to add/modify items to the list
>>>> when you respond back. Any comments/suggestions are more than welcome.
>>>>
>>
>
Hi,
> From: Nauman Rafique <[email protected]>
> Date: Mon, Nov 17, 2008 09:01:48PM -0800
>
> If we start with bfq patches, this is how plan would look like:
>
> 1 Start with BFQ take 2.
> 2 Do the following to support proportional division:
> a) Expose the per device weight interface to user, instead of calculating
> from priority.
> b) Add support for disk time budgets, besides sector budget that is currently
> available (configurable option). (Fabio: Do you think we can just emulate
> that using the existing code?). Another approach would be to give time slices
> just like CFQ (discussing?)
it should be possible without altering the code. The slices can be
assigned in the time domain using big values for max_budget. The logic
is: each process is assigned a budget (in the range [max_budget/2, max_budget],
choosen from the feedback mechanism, driven in __bfq_bfqq_recalc_budget()),
and if it does not complete it in timeout_sync milliseconds, it is
charged a fixed amount of sectors of service.
Using big values for max_budget (where big means greater than two
times the number of sectors the hard drive can transfer in timeout_sync
milliseconds) makes the budgets always to time out, so the disk time
is scheduled in slices of timeout_sync.
However this is just a temporary workaround to do some basic testing.
Modifying the scheduler to support time slices instead of sector
budgets would indeed simplify the code; I think that the drawback
would be being too unfair in the service domain. Of course we
have to consider how much is important to be fair in the service
domain, and how much added complexity/new code can we accept for it.
[ Better service domain fairness is one of the main reasons why
we started working on bfq, so, talking for me and Paolo it _is_
important :) ]
I have to think a little bit on how it would be possible to support
an option for time-only budgets, coexisting with the current behavior,
but I think it can be done.
> 4 Do the following to support the goals of 2 level schedulers:
> a) Limit the request descriptors allocated to each cgroup by adding
> functionality to elv_may_queue()
> b) Add support for putting an absolute limit on IO consumed by a
> cgroup. Such support is provided by Andrea
> Righi's patches too.
> c) Add support (configurable option) to keep track of total disk
> time/sectors/count
> consumed at each device, and factor that into scheduling decision
> (more discussion needed here)
> 6 Incorporate an IO tracking approach which re-uses memory resource
> controller code but is not dependent on it (may be biocgroup patches from
> dm-ioband can be used here directly)
> 7 Start an offline email thread to keep track of progress on the above
> goals.
>
> BFQ's support for hierarchy of cgroups means that its close to where
> we want to get. Any comments on what approach looks better?
>
The main problems with this approach (as with the cfq-based ones) in
my opinion are:
- the request descriptor allocation problem Divyesh talked about,
- the impossibility of respecting different weights, resulting from
the interlock problem with synchronous requests Vivek talked about
[ in cfq/bfq this can happen when idling is disabled, e.g., for
SSDs, or when using NCQ ],
but I think that correctly addressing your points 4.a) and 4.b) should
solve them.
On Tue, Nov 18, 2008 at 01:05:08PM +0100, Fabio Checconi wrote:
> Hi,
>
> > From: Nauman Rafique <[email protected]>
> > Date: Mon, Nov 17, 2008 09:01:48PM -0800
> >
> > If we start with bfq patches, this is how plan would look like:
> >
> > 1 Start with BFQ take 2.
> > 2 Do the following to support proportional division:
> > a) Expose the per device weight interface to user, instead of calculating
> > from priority.
> > b) Add support for disk time budgets, besides sector budget that is currently
> > available (configurable option). (Fabio: Do you think we can just emulate
> > that using the existing code?). Another approach would be to give time slices
> > just like CFQ (discussing?)
>
> it should be possible without altering the code. The slices can be
> assigned in the time domain using big values for max_budget. The logic
> is: each process is assigned a budget (in the range [max_budget/2, max_budget],
> choosen from the feedback mechanism, driven in __bfq_bfqq_recalc_budget()),
> and if it does not complete it in timeout_sync milliseconds, it is
> charged a fixed amount of sectors of service.
>
> Using big values for max_budget (where big means greater than two
> times the number of sectors the hard drive can transfer in timeout_sync
> milliseconds) makes the budgets always to time out, so the disk time
> is scheduled in slices of timeout_sync.
>
> However this is just a temporary workaround to do some basic testing.
>
> Modifying the scheduler to support time slices instead of sector
> budgets would indeed simplify the code; I think that the drawback
> would be being too unfair in the service domain. Of course we
> have to consider how much is important to be fair in the service
> domain, and how much added complexity/new code can we accept for it.
>
> [ Better service domain fairness is one of the main reasons why
> we started working on bfq, so, talking for me and Paolo it _is_
> important :) ]
>
> I have to think a little bit on how it would be possible to support
> an option for time-only budgets, coexisting with the current behavior,
> but I think it can be done.
>
IIUC, bfq and cfq are different in following manner.
a. BFQ employs WF2Q+ for fairness and CFQ employes weighted round robin.
b. BFQ uses the budget (sector count) as notion of service and CFQ uses
time slices.
c. BFQ supports hierarchical fair queuing and CFQ does not.
We are looking forward for implementation of point C. Fabio seems to
thinking of supporting time slice as a service (B). It seems like
convergence of CFQ and BFQ except the point A (WF2Q+ vs weighted round
robin).
It looks like WF2Q+ provides tighter service bound and bfq guys mention
that they have been able to ensure throughput while ensuring tighter
bounds. If that's the case, does that mean BFQ is a replacement for CFQ
down the line?
Thanks
Vivek
> From: Vivek Goyal <[email protected]>
> Date: Tue, Nov 18, 2008 09:07:51AM -0500
>
> On Tue, Nov 18, 2008 at 01:05:08PM +0100, Fabio Checconi wrote:
...
> > I have to think a little bit on how it would be possible to support
> > an option for time-only budgets, coexisting with the current behavior,
> > but I think it can be done.
> >
>
> IIUC, bfq and cfq are different in following manner.
>
> a. BFQ employs WF2Q+ for fairness and CFQ employes weighted round robin.
> b. BFQ uses the budget (sector count) as notion of service and CFQ uses
> time slices.
> c. BFQ supports hierarchical fair queuing and CFQ does not.
>
> We are looking forward for implementation of point C. Fabio seems to
> thinking of supporting time slice as a service (B). It seems like
> convergence of CFQ and BFQ except the point A (WF2Q+ vs weighted round
> robin).
>
> It looks like WF2Q+ provides tighter service bound and bfq guys mention
> that they have been able to ensure throughput while ensuring tighter
> bounds. If that's the case, does that mean BFQ is a replacement for CFQ
> down the line?
>
BFQ started from CFQ, extending it in the way you correctly describe,
so it is indeed very similar. There are also some minor changes to
locking, cic handling, hw_tag detection and to the CIC_SEEKY heuristic.
The two schedulers share similar goals, and in my opinion BFQ can be
considered, in the long term, a CFQ replacement; *but* before talking
about replacing CFQ we have to consider that:
- it *needs* review and testing; we've done our best, but for sure
it's not enough; review and testing are never enough;
- the service domain fairness, which was one of our objectives, requires
some extra complexity; the mechanisms we used and the design choices
we've made may not fit all the needs, or may not be as generic as the
simpler CFQ's ones;
- CFQ has years of history behind and has been tuned for a wider
variety of environments than the ones we've been able to test.
If time-based fairness is considered more robust and the loss of
service-domain fairness is not a problem, then the two schedulers can
be made even more similar.
On Tue, Nov 18 2008, Fabio Checconi wrote:
> > From: Vivek Goyal <[email protected]>
> > Date: Tue, Nov 18, 2008 09:07:51AM -0500
> >
> > On Tue, Nov 18, 2008 at 01:05:08PM +0100, Fabio Checconi wrote:
> ...
> > > I have to think a little bit on how it would be possible to support
> > > an option for time-only budgets, coexisting with the current behavior,
> > > but I think it can be done.
> > >
> >
> > IIUC, bfq and cfq are different in following manner.
> >
> > a. BFQ employs WF2Q+ for fairness and CFQ employes weighted round robin.
> > b. BFQ uses the budget (sector count) as notion of service and CFQ uses
> > time slices.
> > c. BFQ supports hierarchical fair queuing and CFQ does not.
> >
> > We are looking forward for implementation of point C. Fabio seems to
> > thinking of supporting time slice as a service (B). It seems like
> > convergence of CFQ and BFQ except the point A (WF2Q+ vs weighted round
> > robin).
> >
> > It looks like WF2Q+ provides tighter service bound and bfq guys mention
> > that they have been able to ensure throughput while ensuring tighter
> > bounds. If that's the case, does that mean BFQ is a replacement for CFQ
> > down the line?
> >
>
> BFQ started from CFQ, extending it in the way you correctly describe,
> so it is indeed very similar. There are also some minor changes to
> locking, cic handling, hw_tag detection and to the CIC_SEEKY heuristic.
>
> The two schedulers share similar goals, and in my opinion BFQ can be
> considered, in the long term, a CFQ replacement; *but* before talking
> about replacing CFQ we have to consider that:
>
> - it *needs* review and testing; we've done our best, but for sure
> it's not enough; review and testing are never enough;
> - the service domain fairness, which was one of our objectives, requires
> some extra complexity; the mechanisms we used and the design choices
> we've made may not fit all the needs, or may not be as generic as the
> simpler CFQ's ones;
> - CFQ has years of history behind and has been tuned for a wider
> variety of environments than the ones we've been able to test.
>
> If time-based fairness is considered more robust and the loss of
> service-domain fairness is not a problem, then the two schedulers can
> be made even more similar.
My preferred approach here would be, in order or TODO:
- Create and test the smallish patches for seekiness, hw_tag checking,
and so on for CFQ.
- Create and test a WF2Q+ service dispatching patch for CFQ.
and if there are leftovers after that, we could even conditionally
enable some of those if appropriate. I think the WF2Q+ is quite cool and
could be easily usable as the default, so it's definitely a viable
alternative.
My main goal here is basically avoiding addition of Yet Another IO
scheduler, especially one that is so closely tied to CFQ already.
I'll start things off by splitting cfq into a few files similar to what
bfq has done, as I think it makes a lot of sense. Fabio, if you could
create patches for the small behavioural changes you made, we can
discuss and hopefully merge those next.
--
Jens Axboe
On Tue, Nov 18, 2008 at 08:12:08PM +0100, Jens Axboe wrote:
> On Tue, Nov 18 2008, Fabio Checconi wrote:
> > > From: Vivek Goyal <[email protected]>
> > > Date: Tue, Nov 18, 2008 09:07:51AM -0500
> > >
> > > On Tue, Nov 18, 2008 at 01:05:08PM +0100, Fabio Checconi wrote:
> > ...
> > > > I have to think a little bit on how it would be possible to support
> > > > an option for time-only budgets, coexisting with the current behavior,
> > > > but I think it can be done.
> > > >
> > >
> > > IIUC, bfq and cfq are different in following manner.
> > >
> > > a. BFQ employs WF2Q+ for fairness and CFQ employes weighted round robin.
> > > b. BFQ uses the budget (sector count) as notion of service and CFQ uses
> > > time slices.
> > > c. BFQ supports hierarchical fair queuing and CFQ does not.
> > >
> > > We are looking forward for implementation of point C. Fabio seems to
> > > thinking of supporting time slice as a service (B). It seems like
> > > convergence of CFQ and BFQ except the point A (WF2Q+ vs weighted round
> > > robin).
> > >
> > > It looks like WF2Q+ provides tighter service bound and bfq guys mention
> > > that they have been able to ensure throughput while ensuring tighter
> > > bounds. If that's the case, does that mean BFQ is a replacement for CFQ
> > > down the line?
> > >
> >
> > BFQ started from CFQ, extending it in the way you correctly describe,
> > so it is indeed very similar. There are also some minor changes to
> > locking, cic handling, hw_tag detection and to the CIC_SEEKY heuristic.
> >
> > The two schedulers share similar goals, and in my opinion BFQ can be
> > considered, in the long term, a CFQ replacement; *but* before talking
> > about replacing CFQ we have to consider that:
> >
> > - it *needs* review and testing; we've done our best, but for sure
> > it's not enough; review and testing are never enough;
> > - the service domain fairness, which was one of our objectives, requires
> > some extra complexity; the mechanisms we used and the design choices
> > we've made may not fit all the needs, or may not be as generic as the
> > simpler CFQ's ones;
> > - CFQ has years of history behind and has been tuned for a wider
> > variety of environments than the ones we've been able to test.
> >
> > If time-based fairness is considered more robust and the loss of
> > service-domain fairness is not a problem, then the two schedulers can
> > be made even more similar.
>
> My preferred approach here would be, in order or TODO:
>
> - Create and test the smallish patches for seekiness, hw_tag checking,
> and so on for CFQ.
> - Create and test a WF2Q+ service dispatching patch for CFQ.
>
Hi Jens,
What do you think about "hierarchical" and cgroup part of BFQ patch? Do
you intend to incorporate/include that piece also or do you think that's
not the way to go for IO controller stuff.
Thanks
Vivek
> and if there are leftovers after that, we could even conditionally
> enable some of those if appropriate. I think the WF2Q+ is quite cool and
> could be easily usable as the default, so it's definitely a viable
> alternative.
>
> My main goal here is basically avoiding addition of Yet Another IO
> scheduler, especially one that is so closely tied to CFQ already.
>
> I'll start things off by splitting cfq into a few files similar to what
> bfq has done, as I think it makes a lot of sense. Fabio, if you could
> create patches for the small behavioural changes you made, we can
> discuss and hopefully merge those next.
>
> --
> Jens Axboe
> From: Jens Axboe <[email protected]>
> Date: Tue, Nov 18, 2008 08:12:08PM +0100
>
> On Tue, Nov 18 2008, Fabio Checconi wrote:
> > BFQ started from CFQ, extending it in the way you correctly describe,
> > so it is indeed very similar. There are also some minor changes to
> > locking, cic handling, hw_tag detection and to the CIC_SEEKY heuristic.
> >
...
> My preferred approach here would be, in order or TODO:
>
> - Create and test the smallish patches for seekiness, hw_tag checking,
> and so on for CFQ.
> - Create and test a WF2Q+ service dispatching patch for CFQ.
>
> and if there are leftovers after that, we could even conditionally
> enable some of those if appropriate. I think the WF2Q+ is quite cool and
> could be easily usable as the default, so it's definitely a viable
> alternative.
>
> My main goal here is basically avoiding addition of Yet Another IO
> scheduler, especially one that is so closely tied to CFQ already.
>
> I'll start things off by splitting cfq into a few files similar to what
> bfq has done, as I think it makes a lot of sense. Fabio, if you could
> create patches for the small behavioural changes you made, we can
> discuss and hopefully merge those next.
>
Ok, I can do that, I need just a little bit of time to organize
the work.
About these small (some of them are really small) changes, a mixed list
of things that they will touch and/or things that I'd like to have clear
before starting to write the patches (maybe we can start another thread
for them):
- In cfq_exit_single_io_context() and in changed_ioprio(), cic->key
is dereferenced without holding any lock. As I reported in [1]
this seems to be a problem when an exit() races with a cfq_exit_queue()
and in a few other cases. In BFQ we used a somehow involved
mechanism to avoid that, abusing rcu (of course we'll have to wait
the patch to talk about it :) ), but given my lack of understanding
of some parts of the block layer, I'd be interested in knowing if
the race is possible and/or if there is something more involved
going on that can cause the same effects.
- set_task_ioprio() in fs/ioprio.c doesn't seem to have a write
memory barrier to pair with the dependent read one in
cfq_get_io_context().
- CFQ_MIN_TT is 2ms, this can result, depending on the value of
HZ in timeouts of one jiffy, that may expire too early, so we are
just wasting time and do not actually wait for the task to present
its new request. Dealing with seeky traffic we've seen a lot of
early timeouts due to one jiffy timers expiring too early, is
it worth fixing or can we live with that?
- To detect hw tagging in BFQ we consider a sample valid iff the
number of requests that the scheduler could have dispatched (given
by cfqd->rb_queued + cfqd->rq_in_driver, i.e., the ones still into
the scheduler plus the ones into the driver) is higher than the
CFQ_HW_QUEUE_MIN threshold. This obviously caused no problems
during testing, but the way CFQ uses now seems a little bit
strange.
- Initially, cic->last_request_pos is zero, so the sdist charged
to a task for its first seek depends on the position on the disk
that is accessed first, independently from its seekiness. Even
if there is a cap on that value, we choose to not charge the first
seek to processes; that resulted in less wrong predictions for
purely sequential loads.
- From my understanding, with shared I/O contexts, two different
tasks may concurrently lookup for a cfqd into the same ioc.
This may result in cfq_drop_dead_cic() being called two times
for the same cic. Am I missing something that prevents that from
happening?
Regarding the code splitup, do you think you'll go for the CFS(BFQ) way,
using a single compilation unit and including the .c files, or a layout
with different compilation units (like the ll_rw_blk.c splitup)?
[1]: http://lkml.org/lkml/2008/8/18/119
On Mon, Nov 17, 2008 at 11:42 PM, Li Zefan <[email protected]> wrote:
> Nauman Rafique wrote:
>> If we start with bfq patches, this is how plan would look like:
>>
>> 1 Start with BFQ take 2.
>> 2 Do the following to support proportional division:
>> a) Expose the per device weight interface to user, instead of calculating
>> from priority.
>> b) Add support for disk time budgets, besides sector budget that is currently
>> available (configurable option). (Fabio: Do you think we can just emulate
>> that using the existing code?). Another approach would be to give time slices
>> just like CFQ (discussing?)
>> 4 Do the following to support the goals of 2 level schedulers:
>> a) Limit the request descriptors allocated to each cgroup by adding
>> functionality to elv_may_queue()
>> b) Add support for putting an absolute limit on IO consumed by a
>> cgroup. Such support is provided by Andrea
>> Righi's patches too.
>> c) Add support (configurable option) to keep track of total disk
>> time/sectors/count
>> consumed at each device, and factor that into scheduling decision
>> (more discussion needed here)
>> 6 Incorporate an IO tracking approach which re-uses memory resource
>> controller code but is not dependent on it (may be biocgroup patches from
>> dm-ioband can be used here directly)
>
> The newest bio_cgroup doesn't use much memcg code I think. The older biocgroup
> tracks IO using mem_cgroup_charge(), and mem_cgroup_charge() remembers a struct page
> owns by which cgroup. But now biocgroup changes to directly put some hooks in
> __set_page_dirty() and some other places to track pages.
I did not look into latest biocgroup patches, so may be you are right.
Nevertheless, bfq currently gets cgroup info out of io context and so
would handle only synchronous reads. For this action item, we have to
make the latest biocgroup patches to work with bfq.
>
>> 7 Start an offline email thread to keep track of progress on the above
>> goals.
>>
>> BFQ's support for hierarchy of cgroups means that its close to where
>> we want to get. Any comments on what approach looks better?
>>
>
> Looks like a sane way :) . We are also trying to keep track of the discussion and
> development of IO controller. I'll start to have a look into BFQ.
>
>> On Mon, Nov 17, 2008 at 6:02 PM, Li Zefan <[email protected]> wrote:
>>> Vivek Goyal wrote:
>>>> On Fri, Nov 14, 2008 at 02:44:22PM -0800, Nauman Rafique wrote:
>>>>> In an attempt to make sure that this discussion leads to
>>>>> something useful, we have summarized the points raised in this
>>>>> discussion and have come up with a strategy for future.
>>>>> The goal of this is to find common ground between all the approaches
>>>>> proposed on this mailing list.
>>>>>
>>>>> 1 Start with Satoshi's latest patches.
>>>> I have had a brief look at both Satoshi's patch and bfq. I kind of like
>>>> bfq's patches for keeping track of per cgroup, per queue data structures.
>>>> May be we can look there also.
>>>>
>>>>> 2 Do the following to support propotional division:
>>>>> a) Give time slices in proportion to weights (configurable
>>>>> option). We can support both priorities and weights by doing
>>>>> propotional division between requests with same priorities.
>>>>> 3 Schedule time slices using WF2Q+ instead of round robin.
>>>>> Test the performance impact (both throughput and jitter in latency).
>>>>> 4 Do the following to support the goals of 2 level schedulers:
>>>>> a) Limit the request descriptors allocated to each cgroup by adding
>>>>> functionality to elv_may_queue()
>>>>> b) Add support for putting an absolute limit on IO consumed by a
>>>>> cgroup. Such support exists in dm-ioband and is provided by Andrea
>>>>> Righi's patches too.
>>>> Does dm-iobnd support abosolute limit? I think till last version they did
>>>> not. I have not check the latest version though.
>>>>
>>> No, dm-ioband still provides weight/share control only. Only Andrea Righi's
>>> patches support absolute limit.
>>
>> Thanks for the correction.
>>
>>>>> c) Add support (configurable option) to keep track of total disk
>>>>> time/sectors/count
>>>>> consumed at each device, and factor that into scheduling decision
>>>>> (more discussion needed here)
>>>>> 5 Support multiple layers of cgroups to align IO controller behavior
>>>>> with CPU scheduling behavior (more discussion?)
>>>>> 6 Incorporate an IO tracking approach which re-uses memory resource
>>>>> controller code but is not dependent on it (may be biocgroup patches from
>>>>> dm-ioband can be used here directly)
>>>>> 7 Start an offline email thread to keep track of progress on the above
>>>>> goals.
>>>>>
>>>>> Please feel free to add/modify items to the list
>>>>> when you respond back. Any comments/suggestions are more than welcome.
>>>>>
>>>
>>
>
On Tue, Nov 18, 2008 at 4:05 AM, Fabio Checconi <[email protected]> wrote:
> Hi,
>
>> From: Nauman Rafique <[email protected]>
>> Date: Mon, Nov 17, 2008 09:01:48PM -0800
>>
>> If we start with bfq patches, this is how plan would look like:
>>
>> 1 Start with BFQ take 2.
>> 2 Do the following to support proportional division:
>> a) Expose the per device weight interface to user, instead of calculating
>> from priority.
>> b) Add support for disk time budgets, besides sector budget that is currently
>> available (configurable option). (Fabio: Do you think we can just emulate
>> that using the existing code?). Another approach would be to give time slices
>> just like CFQ (discussing?)
>
> it should be possible without altering the code. The slices can be
> assigned in the time domain using big values for max_budget. The logic
> is: each process is assigned a budget (in the range [max_budget/2, max_budget],
> choosen from the feedback mechanism, driven in __bfq_bfqq_recalc_budget()),
> and if it does not complete it in timeout_sync milliseconds, it is
> charged a fixed amount of sectors of service.
>
> Using big values for max_budget (where big means greater than two
> times the number of sectors the hard drive can transfer in timeout_sync
> milliseconds) makes the budgets always to time out, so the disk time
> is scheduled in slices of timeout_sync.
>
> However this is just a temporary workaround to do some basic testing.
>
> Modifying the scheduler to support time slices instead of sector
> budgets would indeed simplify the code; I think that the drawback
> would be being too unfair in the service domain. Of course we
> have to consider how much is important to be fair in the service
> domain, and how much added complexity/new code can we accept for it.
>
> [ Better service domain fairness is one of the main reasons why
> we started working on bfq, so, talking for me and Paolo it _is_
> important :) ]
>
> I have to think a little bit on how it would be possible to support
> an option for time-only budgets, coexisting with the current behavior,
> but I think it can be done.
I think "time only budget" vs "sector budget" is dependent on the
definition of fairness: do you want to be fair in the time that is
given to each cgroup or fair in total number of sectors transferred.
And the appropriate definition of fairness depends on how/where the IO
scheduler is used. Do you think the work-around that you mentioned
would have a significant performance difference compared to direct
built-in support?
>
>
>> 4 Do the following to support the goals of 2 level schedulers:
>> a) Limit the request descriptors allocated to each cgroup by adding
>> functionality to elv_may_queue()
>> b) Add support for putting an absolute limit on IO consumed by a
>> cgroup. Such support is provided by Andrea
>> Righi's patches too.
>> c) Add support (configurable option) to keep track of total disk
>> time/sectors/count
>> consumed at each device, and factor that into scheduling decision
>> (more discussion needed here)
>> 6 Incorporate an IO tracking approach which re-uses memory resource
>> controller code but is not dependent on it (may be biocgroup patches from
>> dm-ioband can be used here directly)
>> 7 Start an offline email thread to keep track of progress on the above
>> goals.
>>
>> BFQ's support for hierarchy of cgroups means that its close to where
>> we want to get. Any comments on what approach looks better?
>>
>
> The main problems with this approach (as with the cfq-based ones) in
> my opinion are:
> - the request descriptor allocation problem Divyesh talked about,
> - the impossibility of respecting different weights, resulting from
> the interlock problem with synchronous requests Vivek talked about
> [ in cfq/bfq this can happen when idling is disabled, e.g., for
> SSDs, or when using NCQ ],
>
> but I think that correctly addressing your points 4.a) and 4.b) should
> solve them.
>
On Tue, Nov 18, 2008 at 11:12 AM, Jens Axboe <[email protected]> wrote:
> On Tue, Nov 18 2008, Fabio Checconi wrote:
>> > From: Vivek Goyal <[email protected]>
>> > Date: Tue, Nov 18, 2008 09:07:51AM -0500
>> >
>> > On Tue, Nov 18, 2008 at 01:05:08PM +0100, Fabio Checconi wrote:
>> ...
>> > > I have to think a little bit on how it would be possible to support
>> > > an option for time-only budgets, coexisting with the current behavior,
>> > > but I think it can be done.
>> > >
>> >
>> > IIUC, bfq and cfq are different in following manner.
>> >
>> > a. BFQ employs WF2Q+ for fairness and CFQ employes weighted round robin.
>> > b. BFQ uses the budget (sector count) as notion of service and CFQ uses
>> > time slices.
>> > c. BFQ supports hierarchical fair queuing and CFQ does not.
>> >
>> > We are looking forward for implementation of point C. Fabio seems to
>> > thinking of supporting time slice as a service (B). It seems like
>> > convergence of CFQ and BFQ except the point A (WF2Q+ vs weighted round
>> > robin).
>> >
>> > It looks like WF2Q+ provides tighter service bound and bfq guys mention
>> > that they have been able to ensure throughput while ensuring tighter
>> > bounds. If that's the case, does that mean BFQ is a replacement for CFQ
>> > down the line?
>> >
>>
>> BFQ started from CFQ, extending it in the way you correctly describe,
>> so it is indeed very similar. There are also some minor changes to
>> locking, cic handling, hw_tag detection and to the CIC_SEEKY heuristic.
>>
>> The two schedulers share similar goals, and in my opinion BFQ can be
>> considered, in the long term, a CFQ replacement; *but* before talking
>> about replacing CFQ we have to consider that:
>>
>> - it *needs* review and testing; we've done our best, but for sure
>> it's not enough; review and testing are never enough;
>> - the service domain fairness, which was one of our objectives, requires
>> some extra complexity; the mechanisms we used and the design choices
>> we've made may not fit all the needs, or may not be as generic as the
>> simpler CFQ's ones;
>> - CFQ has years of history behind and has been tuned for a wider
>> variety of environments than the ones we've been able to test.
>>
>> If time-based fairness is considered more robust and the loss of
>> service-domain fairness is not a problem, then the two schedulers can
>> be made even more similar.
>
> My preferred approach here would be, in order or TODO:
>
> - Create and test the smallish patches for seekiness, hw_tag checking,
> and so on for CFQ.
> - Create and test a WF2Q+ service dispatching patch for CFQ.
>
> and if there are leftovers after that, we could even conditionally
> enable some of those if appropriate. I think the WF2Q+ is quite cool and
> could be easily usable as the default, so it's definitely a viable
> alternative.
1 Merge BFQ into CFQ (Jens and Fabio). I am assuming that this would
result in time slices being scheduled using WF2Q+
2 Do the following to support proportional division:
a) Expose the per device weight interface to user, instead of calculating
from priority.
b) Add support for scheduling bandwidth among a hierarchy of cgroups
(besides threads)
3 Do the following to support the goals of 2 level schedulers:
a) Limit the request descriptors allocated to each cgroup by adding
functionality to elv_may_queue()
b) Add support for putting an absolute limit on IO consumed by a
cgroup. Such support is provided by Andrea
Righi's patches too.
c) Add support (configurable option) to keep track of total disk
time/sectors/count
consumed at each device, and factor that into scheduling decision
(more discussion needed here)
6 Incorporate an IO tracking approach which can allow tracking cgroups
for asynchronous reads/writes.
7 Start an offline email thread to keep track of progress on the above
goals.
Jens, what is your opinion everything beyond (1) in the above list?
It would be great if work on (1) and (2)-(7) can happen in parallel so
that we can see "proportional division of IO bandwidth to cgroups" in
tree sooner than later.
>
> My main goal here is basically avoiding addition of Yet Another IO
> scheduler, especially one that is so closely tied to CFQ already.
>
> I'll start things off by splitting cfq into a few files similar to what
> bfq has done, as I think it makes a lot of sense. Fabio, if you could
> create patches for the small behavioural changes you made, we can
> discuss and hopefully merge those next.
>
> --
> Jens Axboe
>
>
> From: Nauman Rafique <[email protected]>
> Date: Tue, Nov 18, 2008 02:33:19PM -0800
>
> On Tue, Nov 18, 2008 at 4:05 AM, Fabio Checconi <[email protected]> wrote:
...
> > it should be possible without altering the code. The slices can be
> > assigned in the time domain using big values for max_budget. The logic
> > is: each process is assigned a budget (in the range [max_budget/2, max_budget],
> > choosen from the feedback mechanism, driven in __bfq_bfqq_recalc_budget()),
> > and if it does not complete it in timeout_sync milliseconds, it is
> > charged a fixed amount of sectors of service.
> >
> > Using big values for max_budget (where big means greater than two
> > times the number of sectors the hard drive can transfer in timeout_sync
> > milliseconds) makes the budgets always to time out, so the disk time
> > is scheduled in slices of timeout_sync.
> >
> > However this is just a temporary workaround to do some basic testing.
> >
> > Modifying the scheduler to support time slices instead of sector
> > budgets would indeed simplify the code; I think that the drawback
> > would be being too unfair in the service domain. Of course we
> > have to consider how much is important to be fair in the service
> > domain, and how much added complexity/new code can we accept for it.
> >
> > [ Better service domain fairness is one of the main reasons why
> > we started working on bfq, so, talking for me and Paolo it _is_
> > important :) ]
> >
> > I have to think a little bit on how it would be possible to support
> > an option for time-only budgets, coexisting with the current behavior,
> > but I think it can be done.
>
> I think "time only budget" vs "sector budget" is dependent on the
> definition of fairness: do you want to be fair in the time that is
> given to each cgroup or fair in total number of sectors transferred.
> And the appropriate definition of fairness depends on how/where the IO
> scheduler is used. Do you think the work-around that you mentioned
> would have a significant performance difference compared to direct
> built-in support?
>
In terms of throughput, it should not have any influence, since tasks
would always receive a full timeslice. In terms of latency it would
bypass completely the feedback mechanism, and that would have a
negative impact (basically the scheduler would not be able to
differentiate between tasks with the same weight but with different
interactivity needs).
In terms of service fairness it is a little bit hard to say, but I
would not expect anything near to what can be done with a service domain
approach, independently from the scheduler used.
Fabio Checconi wrote:
> - To detect hw tagging in BFQ we consider a sample valid iff the
> number of requests that the scheduler could have dispatched (given
> by cfqd->rb_queued + cfqd->rq_in_driver, i.e., the ones still into
> the scheduler plus the ones into the driver) is higher than the
> CFQ_HW_QUEUE_MIN threshold. This obviously caused no problems
> during testing, but the way CFQ uses now seems a little bit
> strange.
BFQ's tag detection logic is broken in the same way that CFQ's used to
be. Explanation is in this patch:
============================x8============================
commit 45333d5a31296d0af886d94f1d08f128231cab8e
Author: Aaron Carroll <[email protected]>
Date: Tue Aug 26 15:52:36 2008 +0200
cfq-iosched: fix queue depth detection
CFQ's detection of queueing devices assumes a non-queuing device and detects
if the queue depth reaches a certain threshold. Under some workloads (e.g.
synchronous reads), CFQ effectively forces a unit queue depth, thus defeating
the detection logic. This leads to poor performance on queuing hardware,
since the idle window remains enabled.
This patch inverts the sense of the logic: assume a queuing-capable device,
and detect if the depth does not exceed the threshold.
============================x8=============================
BFQ seems better than CFQ at avoiding this problem though. Using the following fio
job, I can routinely trigger it for 10s or so before BFQ detects queuing.
============================x8=============================
[global]
direct=1
ioengine=sync
norandommap
randrepeat=0
filename=/dev/sdb
bs=16k
runtime=200
time_based
[reader]
rw=randread
numjobs=128
============================x8=============================
Nauman Rafique ha scritto:
>
>
> I think "time only budget" vs "sector budget" is dependent on the
> definition of fairness: do you want to be fair in the time that is
> given to each cgroup or fair in total number of sectors transferred.
> And the appropriate definition of fairness depends on how/where the IO
> scheduler is used. ...
>
>
Just a general note: as Fabio already said, switching back to time
budgets in BFQ would be (conceptually) straightforward.
However, we will never get fairness in bandwidth distribution if we work
(only) in the time domain.
--
-----------------------------------------------------------
| Paolo Valente | |
| Algogroup | |
| Dip. Ing. Informazione | tel: +39 059 2056318 |
| Via Vignolese 905/b | fax: +39 059 2056199 |
| 41100 Modena | |
| home: http://algo.ing.unimo.it/people/paolo/ |
-----------------------------------------------------------
> From: Aaron Carroll <[email protected]>
> Date: Wed, Nov 19, 2008 12:52:42PM +1100
>
> Fabio Checconi wrote:
> > - To detect hw tagging in BFQ we consider a sample valid iff the
> > number of requests that the scheduler could have dispatched (given
> > by cfqd->rb_queued + cfqd->rq_in_driver, i.e., the ones still into
> > the scheduler plus the ones into the driver) is higher than the
> > CFQ_HW_QUEUE_MIN threshold. This obviously caused no problems
> > during testing, but the way CFQ uses now seems a little bit
> > strange.
>
> BFQ's tag detection logic is broken in the same way that CFQ's used to
> be. Explanation is in this patch:
>
If you look at bfq_update_hw_tag(), the logic introduced by the patch
you mention is still there; BFQ starts with ->hw_tag = 1, and updates it
every 32 valid samples. What changed WRT your patch, apart from the
number of samples, is that the condition for a sample to be valid is:
bfqd->rq_in_driver + bfqd->queued >= 5
while in your patch it is:
cfqd->rq_queued > 5 || cfqd->rq_in_driver > 5
We preferred the first one because that sum better reflects the number
of requests that could have been dispatched, and I don't think that this
is wrong.
There is a problem, but it's not within the tag detection logic itself.
>From some quick experiments, what happens is that when a process starts,
CFQ considers it seeky (*), BFQ doesn't. As a side effect BFQ does not
always dispatch enough requests to correctly detect tagging.
At the first seek you cannot tell if the process is going to bee seeky
or not, and we have chosen to consider it sequential because it improved
fairness in some sequential workloads (the CIC_SEEKY heuristic is used
also to determine the idle_window length in [bc]fq_arm_slice_timer()).
Anyway, we're dealing with heuristics, and they tend to favor some
workload over other ones. If recovering this thoughput loss is more
important than a transient unfairness due to short idling windows assigned
to sequential processes when they start, I've no problems in switching
the CIC_SEEKY logic to consider a process seeky when it starts.
Thank you for testing and for pointing out this issue, we missed it
in our testing.
(*) to be correct, the initial classification depends on the position
of the first accessed sector.
> From: Fabio Checconi <[email protected]>
> Date: Wed, Nov 19, 2008 11:17:01AM +0100
>
> > From: Aaron Carroll <[email protected]>
> > Date: Wed, Nov 19, 2008 12:52:42PM +1100
> >
> > Fabio Checconi wrote:
> > > - To detect hw tagging in BFQ we consider a sample valid iff the
> > > number of requests that the scheduler could have dispatched (given
> > > by cfqd->rb_queued + cfqd->rq_in_driver, i.e., the ones still into
> > > the scheduler plus the ones into the driver) is higher than the
> > > CFQ_HW_QUEUE_MIN threshold. This obviously caused no problems
> > > during testing, but the way CFQ uses now seems a little bit
> > > strange.
> >
> > BFQ's tag detection logic is broken in the same way that CFQ's used to
> > be. Explanation is in this patch:
> >
>
> If you look at bfq_update_hw_tag(), the logic introduced by the patch
> you mention is still there; BFQ starts with ->hw_tag = 1, and updates it
> every 32 valid samples. What changed WRT your patch, apart from the
> number of samples, is that the condition for a sample to be valid is:
>
> bfqd->rq_in_driver + bfqd->queued >= 5
>
> while in your patch it is:
>
> cfqd->rq_queued > 5 || cfqd->rq_in_driver > 5
>
> We preferred the first one because that sum better reflects the number
> of requests that could have been dispatched, and I don't think that this
> is wrong.
>
> There is a problem, but it's not within the tag detection logic itself.
> From some quick experiments, what happens is that when a process starts,
> CFQ considers it seeky (*), BFQ doesn't. As a side effect BFQ does not
> always dispatch enough requests to correctly detect tagging.
>
> At the first seek you cannot tell if the process is going to bee seeky
> or not, and we have chosen to consider it sequential because it improved
> fairness in some sequential workloads (the CIC_SEEKY heuristic is used
> also to determine the idle_window length in [bc]fq_arm_slice_timer()).
>
> Anyway, we're dealing with heuristics, and they tend to favor some
> workload over other ones. If recovering this thoughput loss is more
> important than a transient unfairness due to short idling windows assigned
> to sequential processes when they start, I've no problems in switching
> the CIC_SEEKY logic to consider a process seeky when it starts.
>
> Thank you for testing and for pointing out this issue, we missed it
> in our testing.
>
>
> (*) to be correct, the initial classification depends on the position
> of the first accessed sector.
Sorry, I forgot the patch... This seems to solve the problem with
your workload here, does it work for you?
[ The magic number would not appear in a definitive fix... ]
---
diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 83e90e9..e9b010f 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -1322,10 +1322,12 @@ static void bfq_update_io_seektime(struct bfq_data *bfqd,
/*
* Don't allow the seek distance to get too large from the
- * odd fragment, pagein, etc.
+ * odd fragment, pagein, etc. The first request is not
+ * really a seek, but we consider a cic seeky on creation
+ * to make the hw_tag detection logic work better.
*/
- if (cic->seek_samples == 0) /* first request, not really a seek */
- sdist = 0;
+ if (cic->seek_samples == 0)
+ sdist = 8 * 1024 + 1;
else if (cic->seek_samples <= 60) /* second&third seek */
sdist = min(sdist, (cic->seek_mean * 4) + 2*1024*1024);
else
On Tue, Nov 18 2008, Nauman Rafique wrote:
> On Tue, Nov 18, 2008 at 11:12 AM, Jens Axboe <[email protected]> wrote:
> > On Tue, Nov 18 2008, Fabio Checconi wrote:
> >> > From: Vivek Goyal <[email protected]>
> >> > Date: Tue, Nov 18, 2008 09:07:51AM -0500
> >> >
> >> > On Tue, Nov 18, 2008 at 01:05:08PM +0100, Fabio Checconi wrote:
> >> ...
> >> > > I have to think a little bit on how it would be possible to support
> >> > > an option for time-only budgets, coexisting with the current behavior,
> >> > > but I think it can be done.
> >> > >
> >> >
> >> > IIUC, bfq and cfq are different in following manner.
> >> >
> >> > a. BFQ employs WF2Q+ for fairness and CFQ employes weighted round robin.
> >> > b. BFQ uses the budget (sector count) as notion of service and CFQ uses
> >> > time slices.
> >> > c. BFQ supports hierarchical fair queuing and CFQ does not.
> >> >
> >> > We are looking forward for implementation of point C. Fabio seems to
> >> > thinking of supporting time slice as a service (B). It seems like
> >> > convergence of CFQ and BFQ except the point A (WF2Q+ vs weighted round
> >> > robin).
> >> >
> >> > It looks like WF2Q+ provides tighter service bound and bfq guys mention
> >> > that they have been able to ensure throughput while ensuring tighter
> >> > bounds. If that's the case, does that mean BFQ is a replacement for CFQ
> >> > down the line?
> >> >
> >>
> >> BFQ started from CFQ, extending it in the way you correctly describe,
> >> so it is indeed very similar. There are also some minor changes to
> >> locking, cic handling, hw_tag detection and to the CIC_SEEKY heuristic.
> >>
> >> The two schedulers share similar goals, and in my opinion BFQ can be
> >> considered, in the long term, a CFQ replacement; *but* before talking
> >> about replacing CFQ we have to consider that:
> >>
> >> - it *needs* review and testing; we've done our best, but for sure
> >> it's not enough; review and testing are never enough;
> >> - the service domain fairness, which was one of our objectives, requires
> >> some extra complexity; the mechanisms we used and the design choices
> >> we've made may not fit all the needs, or may not be as generic as the
> >> simpler CFQ's ones;
> >> - CFQ has years of history behind and has been tuned for a wider
> >> variety of environments than the ones we've been able to test.
> >>
> >> If time-based fairness is considered more robust and the loss of
> >> service-domain fairness is not a problem, then the two schedulers can
> >> be made even more similar.
> >
> > My preferred approach here would be, in order or TODO:
> >
> > - Create and test the smallish patches for seekiness, hw_tag checking,
> > and so on for CFQ.
> > - Create and test a WF2Q+ service dispatching patch for CFQ.
> >
> > and if there are leftovers after that, we could even conditionally
> > enable some of those if appropriate. I think the WF2Q+ is quite cool and
> > could be easily usable as the default, so it's definitely a viable
> > alternative.
>
> 1 Merge BFQ into CFQ (Jens and Fabio). I am assuming that this would
> result in time slices being scheduled using WF2Q+
Yep, at least that is my preference.
> 2 Do the following to support proportional division:
> a) Expose the per device weight interface to user, instead of calculating
> from priority.
> b) Add support for scheduling bandwidth among a hierarchy of cgroups
> (besides threads)
> 3 Do the following to support the goals of 2 level schedulers:
> a) Limit the request descriptors allocated to each cgroup by adding
> functionality to elv_may_queue()
> b) Add support for putting an absolute limit on IO consumed by a
> cgroup. Such support is provided by Andrea
> Righi's patches too.
> c) Add support (configurable option) to keep track of total disk
> time/sectors/count
> consumed at each device, and factor that into scheduling decision
> (more discussion needed here)
> 6 Incorporate an IO tracking approach which can allow tracking cgroups
> for asynchronous reads/writes.
> 7 Start an offline email thread to keep track of progress on the above
> goals.
>
> Jens, what is your opinion everything beyond (1) in the above list?
>
> It would be great if work on (1) and (2)-(7) can happen in parallel so
> that we can see "proportional division of IO bandwidth to cgroups" in
> tree sooner than later.
Sounds feasible, I'd like to see the cgroups approach get more traction.
My primary concern is just that I don't want to merge it into specific
IO schedulers. As you mention, we can hook into the may queue logic for
that subset of the problem, that avoids touching the io scheduler. If we
can get this supported 'generically', then I'd be happy to help out.
--
Jens Axboe
On Tue, Nov 18 2008, Fabio Checconi wrote:
> > From: Jens Axboe <[email protected]>
> > Date: Tue, Nov 18, 2008 08:12:08PM +0100
> >
> > On Tue, Nov 18 2008, Fabio Checconi wrote:
> > > BFQ started from CFQ, extending it in the way you correctly describe,
> > > so it is indeed very similar. There are also some minor changes to
> > > locking, cic handling, hw_tag detection and to the CIC_SEEKY heuristic.
> > >
> ...
> > My preferred approach here would be, in order or TODO:
> >
> > - Create and test the smallish patches for seekiness, hw_tag checking,
> > and so on for CFQ.
> > - Create and test a WF2Q+ service dispatching patch for CFQ.
> >
> > and if there are leftovers after that, we could even conditionally
> > enable some of those if appropriate. I think the WF2Q+ is quite cool and
> > could be easily usable as the default, so it's definitely a viable
> > alternative.
> >
> > My main goal here is basically avoiding addition of Yet Another IO
> > scheduler, especially one that is so closely tied to CFQ already.
> >
> > I'll start things off by splitting cfq into a few files similar to what
> > bfq has done, as I think it makes a lot of sense. Fabio, if you could
> > create patches for the small behavioural changes you made, we can
> > discuss and hopefully merge those next.
> >
>
> Ok, I can do that, I need just a little bit of time to organize
> the work.
>
> About these small (some of them are really small) changes, a mixed list
> of things that they will touch and/or things that I'd like to have clear
> before starting to write the patches (maybe we can start another thread
> for them):
>
> - In cfq_exit_single_io_context() and in changed_ioprio(), cic->key
> is dereferenced without holding any lock. As I reported in [1]
> this seems to be a problem when an exit() races with a cfq_exit_queue()
> and in a few other cases. In BFQ we used a somehow involved
> mechanism to avoid that, abusing rcu (of course we'll have to wait
> the patch to talk about it :) ), but given my lack of understanding
> of some parts of the block layer, I'd be interested in knowing if
> the race is possible and/or if there is something more involved
> going on that can cause the same effects.
OK, I'm assuming this is where Nikanth got his idea for the patch from?
It does seem racy in spots, we can definitely proceed on getting that
tightened up some more.
> - set_task_ioprio() in fs/ioprio.c doesn't seem to have a write
> memory barrier to pair with the dependent read one in
> cfq_get_io_context().
Agree, that needs fixing.
> - CFQ_MIN_TT is 2ms, this can result, depending on the value of
> HZ in timeouts of one jiffy, that may expire too early, so we are
> just wasting time and do not actually wait for the task to present
> its new request. Dealing with seeky traffic we've seen a lot of
> early timeouts due to one jiffy timers expiring too early, is
> it worth fixing or can we live with that?
We probably just need to enfore a '2 jiffies minimum' rule for that.
> - To detect hw tagging in BFQ we consider a sample valid iff the
> number of requests that the scheduler could have dispatched (given
> by cfqd->rb_queued + cfqd->rq_in_driver, i.e., the ones still into
> the scheduler plus the ones into the driver) is higher than the
> CFQ_HW_QUEUE_MIN threshold. This obviously caused no problems
> during testing, but the way CFQ uses now seems a little bit
> strange.
Not sure this matters a whole lot, but your approach makes sense. Have
you seen the later change to the CFQ logic from Aaron?
> - Initially, cic->last_request_pos is zero, so the sdist charged
> to a task for its first seek depends on the position on the disk
> that is accessed first, independently from its seekiness. Even
> if there is a cap on that value, we choose to not charge the first
> seek to processes; that resulted in less wrong predictions for
> purely sequential loads.
Agreed, that's is definitely off.
> - From my understanding, with shared I/O contexts, two different
> tasks may concurrently lookup for a cfqd into the same ioc.
> This may result in cfq_drop_dead_cic() being called two times
> for the same cic. Am I missing something that prevents that from
> happening?
That also looks problematic. I guess we need to recheck that under the
lock when in cfq_drop_dead_cic().
> Regarding the code splitup, do you think you'll go for the CFS(BFQ) way,
> using a single compilation unit and including the .c files, or a layout
> with different compilation units (like the ll_rw_blk.c splitup)?
Different compilation units would be my preferred choice.
--
Jens Axboe
> From: Jens Axboe <[email protected]>
> Date: Wed, Nov 19, 2008 03:30:07PM +0100
>
> On Tue, Nov 18 2008, Fabio Checconi wrote:
...
> > - In cfq_exit_single_io_context() and in changed_ioprio(), cic->key
> > is dereferenced without holding any lock. As I reported in [1]
> > this seems to be a problem when an exit() races with a cfq_exit_queue()
> > and in a few other cases. In BFQ we used a somehow involved
> > mechanism to avoid that, abusing rcu (of course we'll have to wait
> > the patch to talk about it :) ), but given my lack of understanding
> > of some parts of the block layer, I'd be interested in knowing if
> > the race is possible and/or if there is something more involved
> > going on that can cause the same effects.
>
> OK, I'm assuming this is where Nikanth got his idea for the patch from?
I think so.
> It does seem racy in spots, we can definitely proceed on getting that
> tightened up some more.
>
> > - set_task_ioprio() in fs/ioprio.c doesn't seem to have a write
> > memory barrier to pair with the dependent read one in
> > cfq_get_io_context().
>
> Agree, that needs fixing.
>
> > - CFQ_MIN_TT is 2ms, this can result, depending on the value of
> > HZ in timeouts of one jiffy, that may expire too early, so we are
> > just wasting time and do not actually wait for the task to present
> > its new request. Dealing with seeky traffic we've seen a lot of
> > early timeouts due to one jiffy timers expiring too early, is
> > it worth fixing or can we live with that?
>
> We probably just need to enfore a '2 jiffies minimum' rule for that.
>
> > - To detect hw tagging in BFQ we consider a sample valid iff the
> > number of requests that the scheduler could have dispatched (given
> > by cfqd->rb_queued + cfqd->rq_in_driver, i.e., the ones still into
> > the scheduler plus the ones into the driver) is higher than the
> > CFQ_HW_QUEUE_MIN threshold. This obviously caused no problems
> > during testing, but the way CFQ uses now seems a little bit
> > strange.
>
> Not sure this matters a whole lot, but your approach makes sense. Have
> you seen the later change to the CFQ logic from Aaron?
>
Yes, we started from his code. As Aaron reported, on BFQ our change
to the CIC_SEEKY logic has a bad interaction with the hw tag detection
on some workloads, but that problem should be easy to solve (test patch
posted in http://lkml.org/lkml/2008/11/19/100).
> > - Initially, cic->last_request_pos is zero, so the sdist charged
> > to a task for its first seek depends on the position on the disk
> > that is accessed first, independently from its seekiness. Even
> > if there is a cap on that value, we choose to not charge the first
> > seek to processes; that resulted in less wrong predictions for
> > purely sequential loads.
>
> Agreed, that's is definitely off.
>
> > - From my understanding, with shared I/O contexts, two different
> > tasks may concurrently lookup for a cfqd into the same ioc.
> > This may result in cfq_drop_dead_cic() being called two times
> > for the same cic. Am I missing something that prevents that from
> > happening?
>
> That also looks problematic. I guess we need to recheck that under the
> lock when in cfq_drop_dead_cic().
>
> > Regarding the code splitup, do you think you'll go for the CFS(BFQ) way,
> > using a single compilation unit and including the .c files, or a layout
> > with different compilation units (like the ll_rw_blk.c splitup)?
>
> Different compilation units would be my preferred choice.
>
Ok, thank you, I'll try to put together and test some patches, and to
post them for discussion in the next few days.
On Wed, Nov 19, 2008 at 6:24 AM, Jens Axboe <[email protected]> wrote:
> On Tue, Nov 18 2008, Nauman Rafique wrote:
>> On Tue, Nov 18, 2008 at 11:12 AM, Jens Axboe <[email protected]> wrote:
>> > On Tue, Nov 18 2008, Fabio Checconi wrote:
>> >> > From: Vivek Goyal <[email protected]>
>> >> > Date: Tue, Nov 18, 2008 09:07:51AM -0500
>> >> >
>> >> > On Tue, Nov 18, 2008 at 01:05:08PM +0100, Fabio Checconi wrote:
>> >> ...
>> >> > > I have to think a little bit on how it would be possible to support
>> >> > > an option for time-only budgets, coexisting with the current behavior,
>> >> > > but I think it can be done.
>> >> > >
>> >> >
>> >> > IIUC, bfq and cfq are different in following manner.
>> >> >
>> >> > a. BFQ employs WF2Q+ for fairness and CFQ employes weighted round robin.
>> >> > b. BFQ uses the budget (sector count) as notion of service and CFQ uses
>> >> > time slices.
>> >> > c. BFQ supports hierarchical fair queuing and CFQ does not.
>> >> >
>> >> > We are looking forward for implementation of point C. Fabio seems to
>> >> > thinking of supporting time slice as a service (B). It seems like
>> >> > convergence of CFQ and BFQ except the point A (WF2Q+ vs weighted round
>> >> > robin).
>> >> >
>> >> > It looks like WF2Q+ provides tighter service bound and bfq guys mention
>> >> > that they have been able to ensure throughput while ensuring tighter
>> >> > bounds. If that's the case, does that mean BFQ is a replacement for CFQ
>> >> > down the line?
>> >> >
>> >>
>> >> BFQ started from CFQ, extending it in the way you correctly describe,
>> >> so it is indeed very similar. There are also some minor changes to
>> >> locking, cic handling, hw_tag detection and to the CIC_SEEKY heuristic.
>> >>
>> >> The two schedulers share similar goals, and in my opinion BFQ can be
>> >> considered, in the long term, a CFQ replacement; *but* before talking
>> >> about replacing CFQ we have to consider that:
>> >>
>> >> - it *needs* review and testing; we've done our best, but for sure
>> >> it's not enough; review and testing are never enough;
>> >> - the service domain fairness, which was one of our objectives, requires
>> >> some extra complexity; the mechanisms we used and the design choices
>> >> we've made may not fit all the needs, or may not be as generic as the
>> >> simpler CFQ's ones;
>> >> - CFQ has years of history behind and has been tuned for a wider
>> >> variety of environments than the ones we've been able to test.
>> >>
>> >> If time-based fairness is considered more robust and the loss of
>> >> service-domain fairness is not a problem, then the two schedulers can
>> >> be made even more similar.
>> >
>> > My preferred approach here would be, in order or TODO:
>> >
>> > - Create and test the smallish patches for seekiness, hw_tag checking,
>> > and so on for CFQ.
>> > - Create and test a WF2Q+ service dispatching patch for CFQ.
>> >
>> > and if there are leftovers after that, we could even conditionally
>> > enable some of those if appropriate. I think the WF2Q+ is quite cool and
>> > could be easily usable as the default, so it's definitely a viable
>> > alternative.
>>
>> 1 Merge BFQ into CFQ (Jens and Fabio). I am assuming that this would
>> result in time slices being scheduled using WF2Q+
>
> Yep, at least that is my preference.
>
>> 2 Do the following to support proportional division:
>> a) Expose the per device weight interface to user, instead of calculating
>> from priority.
>> b) Add support for scheduling bandwidth among a hierarchy of cgroups
>> (besides threads)
>> 3 Do the following to support the goals of 2 level schedulers:
>> a) Limit the request descriptors allocated to each cgroup by adding
>> functionality to elv_may_queue()
>> b) Add support for putting an absolute limit on IO consumed by a
>> cgroup. Such support is provided by Andrea
>> Righi's patches too.
>> c) Add support (configurable option) to keep track of total disk
>> time/sectors/count
>> consumed at each device, and factor that into scheduling decision
>> (more discussion needed here)
>> 6 Incorporate an IO tracking approach which can allow tracking cgroups
>> for asynchronous reads/writes.
>> 7 Start an offline email thread to keep track of progress on the above
>> goals.
>>
>> Jens, what is your opinion everything beyond (1) in the above list?
>>
>> It would be great if work on (1) and (2)-(7) can happen in parallel so
>> that we can see "proportional division of IO bandwidth to cgroups" in
>> tree sooner than later.
>
> Sounds feasible, I'd like to see the cgroups approach get more traction.
> My primary concern is just that I don't want to merge it into specific
> IO schedulers.
Jens,
So are you saying you don't prefer cgroups based proportional IO
division solutions in the IO scheduler but at a layer above so it can
be shared with all IO schedulers?
If yes, then in that case, what do you think about Vivek Goyal's
patch or dm-ioband that achieve that. Of course, both solutions don't
meet all the requirements in the list above, but we can work on that
once we know which direction we should be heading in. In fact, it
would help if you could express the reservations (if you have any)
about these approaches. That would help in coming up with a plan that
everyone agrees on.
Thanks,
DIvyesh
As you mention, we can hook into the may queue logic for
> that subset of the problem, that avoids touching the io scheduler. If we
> can get this supported 'generically', then I'd be happy to help out.
>
> --
> Jens Axboe
>
>
Fabio Checconi wrote:
>>> Fabio Checconi wrote:
>>>> - To detect hw tagging in BFQ we consider a sample valid iff the
>>>> number of requests that the scheduler could have dispatched (given
>>>> by cfqd->rb_queued + cfqd->rq_in_driver, i.e., the ones still into
>>>> the scheduler plus the ones into the driver) is higher than the
>>>> CFQ_HW_QUEUE_MIN threshold. This obviously caused no problems
>>>> during testing, but the way CFQ uses now seems a little bit
>>>> strange.
>>> BFQ's tag detection logic is broken in the same way that CFQ's used to
>>> be. Explanation is in this patch:
>>>
>> If you look at bfq_update_hw_tag(), the logic introduced by the patch
>> you mention is still there; BFQ starts with ->hw_tag = 1, and updates it
Yes, I missed that. So which part of CFQ's hw_tag detection is strange?
>> every 32 valid samples. What changed WRT your patch, apart from the
>> number of samples, is that the condition for a sample to be valid is:
>>
>> bfqd->rq_in_driver + bfqd->queued >= 5
>>
>> while in your patch it is:
>>
>> cfqd->rq_queued > 5 || cfqd->rq_in_driver > 5
>>
>> We preferred the first one because that sum better reflects the number
>> of requests that could have been dispatched, and I don't think that this
>> is wrong.
I think it's fine too. CFQ's condition accounts for a few rare situations,
such as the device stalling or hw_tag being updated right after a bunch of
requests are queued. They are probably irrelevant, but can't hurt.
>> There is a problem, but it's not within the tag detection logic itself.
>> From some quick experiments, what happens is that when a process starts,
>> CFQ considers it seeky (*), BFQ doesn't. As a side effect BFQ does not
>> always dispatch enough requests to correctly detect tagging.
>>
>> At the first seek you cannot tell if the process is going to bee seeky
>> or not, and we have chosen to consider it sequential because it improved
>> fairness in some sequential workloads (the CIC_SEEKY heuristic is used
>> also to determine the idle_window length in [bc]fq_arm_slice_timer()).
>>
>> Anyway, we're dealing with heuristics, and they tend to favor some
>> workload over other ones. If recovering this thoughput loss is more
>> important than a transient unfairness due to short idling windows assigned
>> to sequential processes when they start, I've no problems in switching
>> the CIC_SEEKY logic to consider a process seeky when it starts.
>>
>> Thank you for testing and for pointing out this issue, we missed it
>> in our testing.
>>
>>
>> (*) to be correct, the initial classification depends on the position
>> of the first accessed sector.
>
> Sorry, I forgot the patch... This seems to solve the problem with
> your workload here, does it work for you?
Yes, it works fine now :)
However, hw_tag detection (in CFQ and BFQ) is still broken in a few ways:
* If you go from queue_depth=1 to queue_depth=large, it's possible that
the detection logic fails. This could happen if setting queue_depth
to a larger value at boot, which seems a reasonable situation.
* It depends too much on the hardware. If you have a seekly load on a
fast disk with a unit queue depth, idling sucks for performance (I
imagine this is particularly bad on SSDs). If you have any disk with
a deep queue, not idling sucks for fairness.
I suppose CFQ's slice_resid is supposed to help here, but as far as I can
tell, it doesn't do a thing.
-- Aaron
> From: Aaron Carroll <[email protected]>
> Date: Thu, Nov 20, 2008 03:45:02PM +1100
>
> Fabio Checconi wrote:
> >>> Fabio Checconi wrote:
> >>>> - To detect hw tagging in BFQ we consider a sample valid iff the
> >>>> number of requests that the scheduler could have dispatched (given
> >>>> by cfqd->rb_queued + cfqd->rq_in_driver, i.e., the ones still into
> >>>> the scheduler plus the ones into the driver) is higher than the
> >>>> CFQ_HW_QUEUE_MIN threshold. This obviously caused no problems
> >>>> during testing, but the way CFQ uses now seems a little bit
> >>>> strange.
> >>> BFQ's tag detection logic is broken in the same way that CFQ's used to
> >>> be. Explanation is in this patch:
> >>>
> >> If you look at bfq_update_hw_tag(), the logic introduced by the patch
> >> you mention is still there; BFQ starts with ->hw_tag = 1, and updates it
>
> Yes, I missed that. So which part of CFQ's hw_tag detection is strange?
>
I just think that is rather counterintuitive to consider invalid
a sample when you have, say, rq_in_driver = 1,2,3 or 4 and other
4 queued requests. Considering the actual number of requests that
could have been dispatched seemed more straightforward than
considering the two values separately.
Anyway I think the validity of the samples is a minor issue, while the
throughput loss you experienced was a more serious one.
> >> every 32 valid samples. What changed WRT your patch, apart from the
> >> number of samples, is that the condition for a sample to be valid is:
> >>
> >> bfqd->rq_in_driver + bfqd->queued >= 5
> >>
> >> while in your patch it is:
> >>
> >> cfqd->rq_queued > 5 || cfqd->rq_in_driver > 5
> >>
> >> We preferred the first one because that sum better reflects the number
> >> of requests that could have been dispatched, and I don't think that this
> >> is wrong.
>
> I think it's fine too. CFQ's condition accounts for a few rare situations,
> such as the device stalling or hw_tag being updated right after a bunch of
> requests are queued. They are probably irrelevant, but can't hurt.
>
> >> There is a problem, but it's not within the tag detection logic itself.
> >> From some quick experiments, what happens is that when a process starts,
> >> CFQ considers it seeky (*), BFQ doesn't. As a side effect BFQ does not
> >> always dispatch enough requests to correctly detect tagging.
> >>
> >> At the first seek you cannot tell if the process is going to bee seeky
> >> or not, and we have chosen to consider it sequential because it improved
> >> fairness in some sequential workloads (the CIC_SEEKY heuristic is used
> >> also to determine the idle_window length in [bc]fq_arm_slice_timer()).
> >>
> >> Anyway, we're dealing with heuristics, and they tend to favor some
> >> workload over other ones. If recovering this thoughput loss is more
> >> important than a transient unfairness due to short idling windows assigned
> >> to sequential processes when they start, I've no problems in switching
> >> the CIC_SEEKY logic to consider a process seeky when it starts.
> >>
> >> Thank you for testing and for pointing out this issue, we missed it
> >> in our testing.
> >>
> >>
> >> (*) to be correct, the initial classification depends on the position
> >> of the first accessed sector.
> >
> > Sorry, I forgot the patch... This seems to solve the problem with
> > your workload here, does it work for you?
>
> Yes, it works fine now :)
>
Thank you very much for trying it.
> However, hw_tag detection (in CFQ and BFQ) is still broken in a few ways:
> * If you go from queue_depth=1 to queue_depth=large, it's possible that
> the detection logic fails. This could happen if setting queue_depth
> to a larger value at boot, which seems a reasonable situation.
I think that the transition of hw_tag from 1 to 0 can be quite easy, and
may depend only on the workload, while getting back to 1 is more difficult,
because when hw_tag is 0 there may be too few dispatches to detect queueing...
> * It depends too much on the hardware. If you have a seekly load on a
> fast disk with a unit queue depth, idling sucks for performance (I
> imagine this is particularly bad on SSDs). If you have any disk with
> a deep queue, not idling sucks for fairness.
Agreed. This fairness vs. throughput conflict is very workload
dependent too.
> I suppose CFQ's slice_resid is supposed to help here, but as far as I can
> tell, it doesn't do a thing.
>
On Wed, Nov 19 2008, Divyesh Shah wrote:
> On Wed, Nov 19, 2008 at 6:24 AM, Jens Axboe <[email protected]> wrote:
> > On Tue, Nov 18 2008, Nauman Rafique wrote:
> >> On Tue, Nov 18, 2008 at 11:12 AM, Jens Axboe <[email protected]> wrote:
> >> > On Tue, Nov 18 2008, Fabio Checconi wrote:
> >> >> > From: Vivek Goyal <[email protected]>
> >> >> > Date: Tue, Nov 18, 2008 09:07:51AM -0500
> >> >> >
> >> >> > On Tue, Nov 18, 2008 at 01:05:08PM +0100, Fabio Checconi wrote:
> >> >> ...
> >> >> > > I have to think a little bit on how it would be possible to support
> >> >> > > an option for time-only budgets, coexisting with the current behavior,
> >> >> > > but I think it can be done.
> >> >> > >
> >> >> >
> >> >> > IIUC, bfq and cfq are different in following manner.
> >> >> >
> >> >> > a. BFQ employs WF2Q+ for fairness and CFQ employes weighted round robin.
> >> >> > b. BFQ uses the budget (sector count) as notion of service and CFQ uses
> >> >> > time slices.
> >> >> > c. BFQ supports hierarchical fair queuing and CFQ does not.
> >> >> >
> >> >> > We are looking forward for implementation of point C. Fabio seems to
> >> >> > thinking of supporting time slice as a service (B). It seems like
> >> >> > convergence of CFQ and BFQ except the point A (WF2Q+ vs weighted round
> >> >> > robin).
> >> >> >
> >> >> > It looks like WF2Q+ provides tighter service bound and bfq guys mention
> >> >> > that they have been able to ensure throughput while ensuring tighter
> >> >> > bounds. If that's the case, does that mean BFQ is a replacement for CFQ
> >> >> > down the line?
> >> >> >
> >> >>
> >> >> BFQ started from CFQ, extending it in the way you correctly describe,
> >> >> so it is indeed very similar. There are also some minor changes to
> >> >> locking, cic handling, hw_tag detection and to the CIC_SEEKY heuristic.
> >> >>
> >> >> The two schedulers share similar goals, and in my opinion BFQ can be
> >> >> considered, in the long term, a CFQ replacement; *but* before talking
> >> >> about replacing CFQ we have to consider that:
> >> >>
> >> >> - it *needs* review and testing; we've done our best, but for sure
> >> >> it's not enough; review and testing are never enough;
> >> >> - the service domain fairness, which was one of our objectives, requires
> >> >> some extra complexity; the mechanisms we used and the design choices
> >> >> we've made may not fit all the needs, or may not be as generic as the
> >> >> simpler CFQ's ones;
> >> >> - CFQ has years of history behind and has been tuned for a wider
> >> >> variety of environments than the ones we've been able to test.
> >> >>
> >> >> If time-based fairness is considered more robust and the loss of
> >> >> service-domain fairness is not a problem, then the two schedulers can
> >> >> be made even more similar.
> >> >
> >> > My preferred approach here would be, in order or TODO:
> >> >
> >> > - Create and test the smallish patches for seekiness, hw_tag checking,
> >> > and so on for CFQ.
> >> > - Create and test a WF2Q+ service dispatching patch for CFQ.
> >> >
> >> > and if there are leftovers after that, we could even conditionally
> >> > enable some of those if appropriate. I think the WF2Q+ is quite cool and
> >> > could be easily usable as the default, so it's definitely a viable
> >> > alternative.
> >>
> >> 1 Merge BFQ into CFQ (Jens and Fabio). I am assuming that this would
> >> result in time slices being scheduled using WF2Q+
> >
> > Yep, at least that is my preference.
> >
> >> 2 Do the following to support proportional division:
> >> a) Expose the per device weight interface to user, instead of calculating
> >> from priority.
> >> b) Add support for scheduling bandwidth among a hierarchy of cgroups
> >> (besides threads)
> >> 3 Do the following to support the goals of 2 level schedulers:
> >> a) Limit the request descriptors allocated to each cgroup by adding
> >> functionality to elv_may_queue()
> >> b) Add support for putting an absolute limit on IO consumed by a
> >> cgroup. Such support is provided by Andrea
> >> Righi's patches too.
> >> c) Add support (configurable option) to keep track of total disk
> >> time/sectors/count
> >> consumed at each device, and factor that into scheduling decision
> >> (more discussion needed here)
> >> 6 Incorporate an IO tracking approach which can allow tracking cgroups
> >> for asynchronous reads/writes.
> >> 7 Start an offline email thread to keep track of progress on the above
> >> goals.
> >>
> >> Jens, what is your opinion everything beyond (1) in the above list?
> >>
> >> It would be great if work on (1) and (2)-(7) can happen in parallel so
> >> that we can see "proportional division of IO bandwidth to cgroups" in
> >> tree sooner than later.
> >
> > Sounds feasible, I'd like to see the cgroups approach get more traction.
> > My primary concern is just that I don't want to merge it into specific
> > IO schedulers.
>
> Jens,
> So are you saying you don't prefer cgroups based proportional IO
> division solutions in the IO scheduler but at a layer above so it can
> be shared with all IO schedulers?
>
> If yes, then in that case, what do you think about Vivek Goyal's
> patch or dm-ioband that achieve that. Of course, both solutions don't
> meet all the requirements in the list above, but we can work on that
> once we know which direction we should be heading in. In fact, it
> would help if you could express the reservations (if you have any)
> about these approaches. That would help in coming up with a plan that
> everyone agrees on.
The dm approach has some merrits, the major one being that it'll fit
directly into existing setups that use dm and can be controlled with
familiar tools. That is a bonus. The draw back is partially the same -
it'll require dm. So it's still not a fit-all approach, unfortunately.
So I'd prefer an approach that doesn't force you to use dm.
--
Jens Axboe
Hi Vivek,
Sorry for late reply.
> > > Do you have any benchmark results?
> > > I'm especially interested in the followings:
> > > - Comparison of disk performance with and without the I/O controller patch.
> >
> > If I dynamically disable the bio control, then I did not observe any
> > impact on performance. Because in that case practically it boils down
> > to just an additional variable check in __make_request().
> >
>
> Oh.., I understood your question wrong. You are looking for what's the
> performance penalty if I enable the IO controller on a device.
Yes, that is what I want to know.
> I have not done any extensive benchmarking. If I run two dd commands
> without controller, I get 80MB/s from disk (roughly 40 MB for each task).
> With bio group enabled (default token=2000), I was getting total BW of
> roughly 68 MB/s.
>
> I have not done any performance analysis or optimizations at this point of
> time. I plan to do that once we have some sort of common understanding about
> a particular approach. There are so many IO controllers floating, right now
> I am more concerned if we can all come to a common platform.
I understood the reason of posting the patch well.
> Ryo, do you still want to stick to two level scheduling? Given the problem
> of it breaking down underlying scheduler's assumptions, probably it makes
> more sense to the IO control at each individual IO scheduler.
I don't want to stick to it. I'm considering implementing dm-ioband's
algorithm into the block I/O layer experimentally.
Thanks,
Ryo Tsuruta
On Thu, Nov 20, 2008 at 09:16:41AM +0100, Jens Axboe wrote:
> On Wed, Nov 19 2008, Divyesh Shah wrote:
> > On Wed, Nov 19, 2008 at 6:24 AM, Jens Axboe <[email protected]> wrote:
> > > On Tue, Nov 18 2008, Nauman Rafique wrote:
> > >> On Tue, Nov 18, 2008 at 11:12 AM, Jens Axboe <[email protected]> wrote:
> > >> > On Tue, Nov 18 2008, Fabio Checconi wrote:
> > >> >> > From: Vivek Goyal <[email protected]>
> > >> >> > Date: Tue, Nov 18, 2008 09:07:51AM -0500
> > >> >> >
> > >> >> > On Tue, Nov 18, 2008 at 01:05:08PM +0100, Fabio Checconi wrote:
> > >> >> ...
> > >> >> > > I have to think a little bit on how it would be possible to support
> > >> >> > > an option for time-only budgets, coexisting with the current behavior,
> > >> >> > > but I think it can be done.
> > >> >> > >
> > >> >> >
> > >> >> > IIUC, bfq and cfq are different in following manner.
> > >> >> >
> > >> >> > a. BFQ employs WF2Q+ for fairness and CFQ employes weighted round robin.
> > >> >> > b. BFQ uses the budget (sector count) as notion of service and CFQ uses
> > >> >> > time slices.
> > >> >> > c. BFQ supports hierarchical fair queuing and CFQ does not.
> > >> >> >
> > >> >> > We are looking forward for implementation of point C. Fabio seems to
> > >> >> > thinking of supporting time slice as a service (B). It seems like
> > >> >> > convergence of CFQ and BFQ except the point A (WF2Q+ vs weighted round
> > >> >> > robin).
> > >> >> >
> > >> >> > It looks like WF2Q+ provides tighter service bound and bfq guys mention
> > >> >> > that they have been able to ensure throughput while ensuring tighter
> > >> >> > bounds. If that's the case, does that mean BFQ is a replacement for CFQ
> > >> >> > down the line?
> > >> >> >
> > >> >>
> > >> >> BFQ started from CFQ, extending it in the way you correctly describe,
> > >> >> so it is indeed very similar. There are also some minor changes to
> > >> >> locking, cic handling, hw_tag detection and to the CIC_SEEKY heuristic.
> > >> >>
> > >> >> The two schedulers share similar goals, and in my opinion BFQ can be
> > >> >> considered, in the long term, a CFQ replacement; *but* before talking
> > >> >> about replacing CFQ we have to consider that:
> > >> >>
> > >> >> - it *needs* review and testing; we've done our best, but for sure
> > >> >> it's not enough; review and testing are never enough;
> > >> >> - the service domain fairness, which was one of our objectives, requires
> > >> >> some extra complexity; the mechanisms we used and the design choices
> > >> >> we've made may not fit all the needs, or may not be as generic as the
> > >> >> simpler CFQ's ones;
> > >> >> - CFQ has years of history behind and has been tuned for a wider
> > >> >> variety of environments than the ones we've been able to test.
> > >> >>
> > >> >> If time-based fairness is considered more robust and the loss of
> > >> >> service-domain fairness is not a problem, then the two schedulers can
> > >> >> be made even more similar.
> > >> >
> > >> > My preferred approach here would be, in order or TODO:
> > >> >
> > >> > - Create and test the smallish patches for seekiness, hw_tag checking,
> > >> > and so on for CFQ.
> > >> > - Create and test a WF2Q+ service dispatching patch for CFQ.
> > >> >
> > >> > and if there are leftovers after that, we could even conditionally
> > >> > enable some of those if appropriate. I think the WF2Q+ is quite cool and
> > >> > could be easily usable as the default, so it's definitely a viable
> > >> > alternative.
> > >>
> > >> 1 Merge BFQ into CFQ (Jens and Fabio). I am assuming that this would
> > >> result in time slices being scheduled using WF2Q+
> > >
> > > Yep, at least that is my preference.
> > >
> > >> 2 Do the following to support proportional division:
> > >> a) Expose the per device weight interface to user, instead of calculating
> > >> from priority.
> > >> b) Add support for scheduling bandwidth among a hierarchy of cgroups
> > >> (besides threads)
> > >> 3 Do the following to support the goals of 2 level schedulers:
> > >> a) Limit the request descriptors allocated to each cgroup by adding
> > >> functionality to elv_may_queue()
> > >> b) Add support for putting an absolute limit on IO consumed by a
> > >> cgroup. Such support is provided by Andrea
> > >> Righi's patches too.
> > >> c) Add support (configurable option) to keep track of total disk
> > >> time/sectors/count
> > >> consumed at each device, and factor that into scheduling decision
> > >> (more discussion needed here)
> > >> 6 Incorporate an IO tracking approach which can allow tracking cgroups
> > >> for asynchronous reads/writes.
> > >> 7 Start an offline email thread to keep track of progress on the above
> > >> goals.
> > >>
> > >> Jens, what is your opinion everything beyond (1) in the above list?
> > >>
> > >> It would be great if work on (1) and (2)-(7) can happen in parallel so
> > >> that we can see "proportional division of IO bandwidth to cgroups" in
> > >> tree sooner than later.
> > >
> > > Sounds feasible, I'd like to see the cgroups approach get more traction.
> > > My primary concern is just that I don't want to merge it into specific
> > > IO schedulers.
> >
> > Jens,
> > So are you saying you don't prefer cgroups based proportional IO
> > division solutions in the IO scheduler but at a layer above so it can
> > be shared with all IO schedulers?
> >
> > If yes, then in that case, what do you think about Vivek Goyal's
> > patch or dm-ioband that achieve that. Of course, both solutions don't
> > meet all the requirements in the list above, but we can work on that
> > once we know which direction we should be heading in. In fact, it
> > would help if you could express the reservations (if you have any)
> > about these approaches. That would help in coming up with a plan that
> > everyone agrees on.
>
> The dm approach has some merrits, the major one being that it'll fit
> directly into existing setups that use dm and can be controlled with
> familiar tools. That is a bonus. The draw back is partially the same -
> it'll require dm. So it's still not a fit-all approach, unfortunately.
>
> So I'd prefer an approach that doesn't force you to use dm.
Hi Jens,
My patches met the goal of not using the dm for every device one wants
to control.
Having said that, few things come to mind.
- In what cases do we need to control the higher level logical devices
like dm. It looks like real contention for resources is at leaf nodes.
Hence any kind of resource management/fair queueing should probably be
done at leaf nodes and not at higher level logical nodes.
If that makes sense, then probably we don't need to control dm device
and we don't need such higher level solutions.
- Any kind of 2 level scheduler solution has the potential to break the
underlying IO scheduler. Higher level solution requires buffering of
bios and controlled release of bios to lower layers. This control breaks
the assumptions of lower layer IO scheduler which knows in what order
bios should be dispatched to device to meet the semantics exported by
the IO scheduler.
- 2nd level scheduler does not keep track of tasks but task groups lets
every group dispatch fair share. This has got little semantic problem in
the sense that tasks and groups in root cgroup will not be considered at
same level. "root" will be considered one group at same level with all
child group hence competing with them for resources.
This looks little odd. Considering tasks and groups same level kind of
makes more sense. cpu scheduler also consideres tasks and groups at same
level and deviation from that probably is not very good.
Considering tasks and groups at same level will matter only if IO
scheduler maintains separate queue for the task, like CFQ. Because
in that case IO scheduler tries to provide fairness among various task
queues. Some schedulers like noop don't have any notion of separate
task queues and fairness among them. In that case probably we don't
have a choice but to assume root group competing with child groups.
Keeping above points in mind, probably two level scheduling is not a
very good idea. If putting the code in a particular IO scheduler is a
concern we can probably explore ways regarding how we can maximize the
sharing of cgroup code among IO schedulers.
Thanks
Vivek
On Thu, Nov 20, 2008 at 06:20:53PM +0900, Ryo Tsuruta wrote:
> Hi Vivek,
>
> Sorry for late reply.
>
> > > > Do you have any benchmark results?
> > > > I'm especially interested in the followings:
> > > > - Comparison of disk performance with and without the I/O controller patch.
> > >
> > > If I dynamically disable the bio control, then I did not observe any
> > > impact on performance. Because in that case practically it boils down
> > > to just an additional variable check in __make_request().
> > >
> >
> > Oh.., I understood your question wrong. You are looking for what's the
> > performance penalty if I enable the IO controller on a device.
>
> Yes, that is what I want to know.
>
> > I have not done any extensive benchmarking. If I run two dd commands
> > without controller, I get 80MB/s from disk (roughly 40 MB for each task).
> > With bio group enabled (default token=2000), I was getting total BW of
> > roughly 68 MB/s.
> >
> > I have not done any performance analysis or optimizations at this point of
> > time. I plan to do that once we have some sort of common understanding about
> > a particular approach. There are so many IO controllers floating, right now
> > I am more concerned if we can all come to a common platform.
>
> I understood the reason of posting the patch well.
>
> > Ryo, do you still want to stick to two level scheduling? Given the problem
> > of it breaking down underlying scheduler's assumptions, probably it makes
> > more sense to the IO control at each individual IO scheduler.
>
> I don't want to stick to it. I'm considering implementing dm-ioband's
> algorithm into the block I/O layer experimentally.
Thanks Ryo. Implementing a control at block layer sounds like another
2 level scheduling. We will still have the issue of breaking underlying
CFQ and other schedulers. How to plan to resolve that conflict.
What do you think about the solution at IO scheduler level (like BFQ) or
may be little above that where one can try some code sharing among IO
schedulers?
Thanks
Vivek
On Thu, Nov 20, 2008 at 5:40 AM, Vivek Goyal <[email protected]> wrote:
> On Thu, Nov 20, 2008 at 09:16:41AM +0100, Jens Axboe wrote:
>> On Wed, Nov 19 2008, Divyesh Shah wrote:
>> > On Wed, Nov 19, 2008 at 6:24 AM, Jens Axboe <[email protected]> wrote:
>> > > On Tue, Nov 18 2008, Nauman Rafique wrote:
>> > >> On Tue, Nov 18, 2008 at 11:12 AM, Jens Axboe <[email protected]> wrote:
>> > >> > On Tue, Nov 18 2008, Fabio Checconi wrote:
>> > >> >> > From: Vivek Goyal <[email protected]>
>> > >> >> > Date: Tue, Nov 18, 2008 09:07:51AM -0500
>> > >> >> >
>> > >> >> > On Tue, Nov 18, 2008 at 01:05:08PM +0100, Fabio Checconi wrote:
>> > >> >> ...
>> > >> >> > > I have to think a little bit on how it would be possible to support
>> > >> >> > > an option for time-only budgets, coexisting with the current behavior,
>> > >> >> > > but I think it can be done.
>> > >> >> > >
>> > >> >> >
>> > >> >> > IIUC, bfq and cfq are different in following manner.
>> > >> >> >
>> > >> >> > a. BFQ employs WF2Q+ for fairness and CFQ employes weighted round robin.
>> > >> >> > b. BFQ uses the budget (sector count) as notion of service and CFQ uses
>> > >> >> > time slices.
>> > >> >> > c. BFQ supports hierarchical fair queuing and CFQ does not.
>> > >> >> >
>> > >> >> > We are looking forward for implementation of point C. Fabio seems to
>> > >> >> > thinking of supporting time slice as a service (B). It seems like
>> > >> >> > convergence of CFQ and BFQ except the point A (WF2Q+ vs weighted round
>> > >> >> > robin).
>> > >> >> >
>> > >> >> > It looks like WF2Q+ provides tighter service bound and bfq guys mention
>> > >> >> > that they have been able to ensure throughput while ensuring tighter
>> > >> >> > bounds. If that's the case, does that mean BFQ is a replacement for CFQ
>> > >> >> > down the line?
>> > >> >> >
>> > >> >>
>> > >> >> BFQ started from CFQ, extending it in the way you correctly describe,
>> > >> >> so it is indeed very similar. There are also some minor changes to
>> > >> >> locking, cic handling, hw_tag detection and to the CIC_SEEKY heuristic.
>> > >> >>
>> > >> >> The two schedulers share similar goals, and in my opinion BFQ can be
>> > >> >> considered, in the long term, a CFQ replacement; *but* before talking
>> > >> >> about replacing CFQ we have to consider that:
>> > >> >>
>> > >> >> - it *needs* review and testing; we've done our best, but for sure
>> > >> >> it's not enough; review and testing are never enough;
>> > >> >> - the service domain fairness, which was one of our objectives, requires
>> > >> >> some extra complexity; the mechanisms we used and the design choices
>> > >> >> we've made may not fit all the needs, or may not be as generic as the
>> > >> >> simpler CFQ's ones;
>> > >> >> - CFQ has years of history behind and has been tuned for a wider
>> > >> >> variety of environments than the ones we've been able to test.
>> > >> >>
>> > >> >> If time-based fairness is considered more robust and the loss of
>> > >> >> service-domain fairness is not a problem, then the two schedulers can
>> > >> >> be made even more similar.
>> > >> >
>> > >> > My preferred approach here would be, in order or TODO:
>> > >> >
>> > >> > - Create and test the smallish patches for seekiness, hw_tag checking,
>> > >> > and so on for CFQ.
>> > >> > - Create and test a WF2Q+ service dispatching patch for CFQ.
>> > >> >
>> > >> > and if there are leftovers after that, we could even conditionally
>> > >> > enable some of those if appropriate. I think the WF2Q+ is quite cool and
>> > >> > could be easily usable as the default, so it's definitely a viable
>> > >> > alternative.
>> > >>
>> > >> 1 Merge BFQ into CFQ (Jens and Fabio). I am assuming that this would
>> > >> result in time slices being scheduled using WF2Q+
>> > >
>> > > Yep, at least that is my preference.
>> > >
>> > >> 2 Do the following to support proportional division:
>> > >> a) Expose the per device weight interface to user, instead of calculating
>> > >> from priority.
>> > >> b) Add support for scheduling bandwidth among a hierarchy of cgroups
>> > >> (besides threads)
>> > >> 3 Do the following to support the goals of 2 level schedulers:
>> > >> a) Limit the request descriptors allocated to each cgroup by adding
>> > >> functionality to elv_may_queue()
>> > >> b) Add support for putting an absolute limit on IO consumed by a
>> > >> cgroup. Such support is provided by Andrea
>> > >> Righi's patches too.
>> > >> c) Add support (configurable option) to keep track of total disk
>> > >> time/sectors/count
>> > >> consumed at each device, and factor that into scheduling decision
>> > >> (more discussion needed here)
>> > >> 6 Incorporate an IO tracking approach which can allow tracking cgroups
>> > >> for asynchronous reads/writes.
>> > >> 7 Start an offline email thread to keep track of progress on the above
>> > >> goals.
>> > >>
>> > >> Jens, what is your opinion everything beyond (1) in the above list?
>> > >>
>> > >> It would be great if work on (1) and (2)-(7) can happen in parallel so
>> > >> that we can see "proportional division of IO bandwidth to cgroups" in
>> > >> tree sooner than later.
>> > >
>> > > Sounds feasible, I'd like to see the cgroups approach get more traction.
>> > > My primary concern is just that I don't want to merge it into specific
>> > > IO schedulers.
>> >
>> > Jens,
>> > So are you saying you don't prefer cgroups based proportional IO
>> > division solutions in the IO scheduler but at a layer above so it can
>> > be shared with all IO schedulers?
>> >
>> > If yes, then in that case, what do you think about Vivek Goyal's
>> > patch or dm-ioband that achieve that. Of course, both solutions don't
>> > meet all the requirements in the list above, but we can work on that
>> > once we know which direction we should be heading in. In fact, it
>> > would help if you could express the reservations (if you have any)
>> > about these approaches. That would help in coming up with a plan that
>> > everyone agrees on.
>>
>> The dm approach has some merrits, the major one being that it'll fit
>> directly into existing setups that use dm and can be controlled with
>> familiar tools. That is a bonus. The draw back is partially the same -
>> it'll require dm. So it's still not a fit-all approach, unfortunately.
>>
>> So I'd prefer an approach that doesn't force you to use dm.
>
> Hi Jens,
>
> My patches met the goal of not using the dm for every device one wants
> to control.
>
> Having said that, few things come to mind.
>
> - In what cases do we need to control the higher level logical devices
> like dm. It looks like real contention for resources is at leaf nodes.
> Hence any kind of resource management/fair queueing should probably be
> done at leaf nodes and not at higher level logical nodes.
>
> If that makes sense, then probably we don't need to control dm device
> and we don't need such higher level solutions.
>
>
> - Any kind of 2 level scheduler solution has the potential to break the
> underlying IO scheduler. Higher level solution requires buffering of
> bios and controlled release of bios to lower layers. This control breaks
> the assumptions of lower layer IO scheduler which knows in what order
> bios should be dispatched to device to meet the semantics exported by
> the IO scheduler.
>
> - 2nd level scheduler does not keep track of tasks but task groups lets
> every group dispatch fair share. This has got little semantic problem in
> the sense that tasks and groups in root cgroup will not be considered at
> same level. "root" will be considered one group at same level with all
> child group hence competing with them for resources.
>
> This looks little odd. Considering tasks and groups same level kind of
> makes more sense. cpu scheduler also consideres tasks and groups at same
> level and deviation from that probably is not very good.
>
> Considering tasks and groups at same level will matter only if IO
> scheduler maintains separate queue for the task, like CFQ. Because
> in that case IO scheduler tries to provide fairness among various task
> queues. Some schedulers like noop don't have any notion of separate
> task queues and fairness among them. In that case probably we don't
> have a choice but to assume root group competing with child groups.
>
> Keeping above points in mind, probably two level scheduling is not a
> very good idea. If putting the code in a particular IO scheduler is a
> concern we can probably explore ways regarding how we can maximize the
> sharing of cgroup code among IO schedulers.
>
> Thanks
> Vivek
>
It seems that we have a solution if we can figure out a way to share
cgroup code between different schedulers. I am thinking how other
schedulers (AS, Deadline, No-op) would use cgroups. Will they have
proportional division between requests from different cgroups? And use
their own policy (e.g deadline scheduling) within a cgroup? How about
if we have both threads and cgroups at a particular level? I think
putting all threads in a default cgroup seems like a reasonable choice
in this case.
Here is a high level design that comes to mind.
Put proportional division code and state in common code. Each level of
the hierarchy which has more than one cgroup would have some state
maintained in common code. At leaf level of hiearchy, we can have a
cgroup specific scheduler (created when a cgroup is created). We can
choose a different scheduler for each cgroup (we can have a no-op for
one cgroup while cfq for another).
The scheduler gets a callback (just like it does right now from
driver) when the common code schedules a time slice (or budget) from
that cgroup. No buffering/queuing is done in the common code, so it
will still be a single level scheduler. When a request arrives, it is
routed to its scheduler's queues based on its cgroup.
The common code proportional scheduler uses specified weights to
schedule time slices. Please let me know if it makes any sense. And
then we can start talking about lower level details.
On Thu, Nov 20, 2008 at 11:54:14AM -0800, Nauman Rafique wrote:
> On Thu, Nov 20, 2008 at 5:40 AM, Vivek Goyal <[email protected]> wrote:
> > On Thu, Nov 20, 2008 at 09:16:41AM +0100, Jens Axboe wrote:
> >> On Wed, Nov 19 2008, Divyesh Shah wrote:
> >> > On Wed, Nov 19, 2008 at 6:24 AM, Jens Axboe <[email protected]> wrote:
> >> > > On Tue, Nov 18 2008, Nauman Rafique wrote:
> >> > >> On Tue, Nov 18, 2008 at 11:12 AM, Jens Axboe <[email protected]> wrote:
> >> > >> > On Tue, Nov 18 2008, Fabio Checconi wrote:
> >> > >> >> > From: Vivek Goyal <[email protected]>
> >> > >> >> > Date: Tue, Nov 18, 2008 09:07:51AM -0500
> >> > >> >> >
> >> > >> >> > On Tue, Nov 18, 2008 at 01:05:08PM +0100, Fabio Checconi wrote:
> >> > >> >> ...
> >> > >> >> > > I have to think a little bit on how it would be possible to support
> >> > >> >> > > an option for time-only budgets, coexisting with the current behavior,
> >> > >> >> > > but I think it can be done.
> >> > >> >> > >
> >> > >> >> >
> >> > >> >> > IIUC, bfq and cfq are different in following manner.
> >> > >> >> >
> >> > >> >> > a. BFQ employs WF2Q+ for fairness and CFQ employes weighted round robin.
> >> > >> >> > b. BFQ uses the budget (sector count) as notion of service and CFQ uses
> >> > >> >> > time slices.
> >> > >> >> > c. BFQ supports hierarchical fair queuing and CFQ does not.
> >> > >> >> >
> >> > >> >> > We are looking forward for implementation of point C. Fabio seems to
> >> > >> >> > thinking of supporting time slice as a service (B). It seems like
> >> > >> >> > convergence of CFQ and BFQ except the point A (WF2Q+ vs weighted round
> >> > >> >> > robin).
> >> > >> >> >
> >> > >> >> > It looks like WF2Q+ provides tighter service bound and bfq guys mention
> >> > >> >> > that they have been able to ensure throughput while ensuring tighter
> >> > >> >> > bounds. If that's the case, does that mean BFQ is a replacement for CFQ
> >> > >> >> > down the line?
> >> > >> >> >
> >> > >> >>
> >> > >> >> BFQ started from CFQ, extending it in the way you correctly describe,
> >> > >> >> so it is indeed very similar. There are also some minor changes to
> >> > >> >> locking, cic handling, hw_tag detection and to the CIC_SEEKY heuristic.
> >> > >> >>
> >> > >> >> The two schedulers share similar goals, and in my opinion BFQ can be
> >> > >> >> considered, in the long term, a CFQ replacement; *but* before talking
> >> > >> >> about replacing CFQ we have to consider that:
> >> > >> >>
> >> > >> >> - it *needs* review and testing; we've done our best, but for sure
> >> > >> >> it's not enough; review and testing are never enough;
> >> > >> >> - the service domain fairness, which was one of our objectives, requires
> >> > >> >> some extra complexity; the mechanisms we used and the design choices
> >> > >> >> we've made may not fit all the needs, or may not be as generic as the
> >> > >> >> simpler CFQ's ones;
> >> > >> >> - CFQ has years of history behind and has been tuned for a wider
> >> > >> >> variety of environments than the ones we've been able to test.
> >> > >> >>
> >> > >> >> If time-based fairness is considered more robust and the loss of
> >> > >> >> service-domain fairness is not a problem, then the two schedulers can
> >> > >> >> be made even more similar.
> >> > >> >
> >> > >> > My preferred approach here would be, in order or TODO:
> >> > >> >
> >> > >> > - Create and test the smallish patches for seekiness, hw_tag checking,
> >> > >> > and so on for CFQ.
> >> > >> > - Create and test a WF2Q+ service dispatching patch for CFQ.
> >> > >> >
> >> > >> > and if there are leftovers after that, we could even conditionally
> >> > >> > enable some of those if appropriate. I think the WF2Q+ is quite cool and
> >> > >> > could be easily usable as the default, so it's definitely a viable
> >> > >> > alternative.
> >> > >>
> >> > >> 1 Merge BFQ into CFQ (Jens and Fabio). I am assuming that this would
> >> > >> result in time slices being scheduled using WF2Q+
> >> > >
> >> > > Yep, at least that is my preference.
> >> > >
> >> > >> 2 Do the following to support proportional division:
> >> > >> a) Expose the per device weight interface to user, instead of calculating
> >> > >> from priority.
> >> > >> b) Add support for scheduling bandwidth among a hierarchy of cgroups
> >> > >> (besides threads)
> >> > >> 3 Do the following to support the goals of 2 level schedulers:
> >> > >> a) Limit the request descriptors allocated to each cgroup by adding
> >> > >> functionality to elv_may_queue()
> >> > >> b) Add support for putting an absolute limit on IO consumed by a
> >> > >> cgroup. Such support is provided by Andrea
> >> > >> Righi's patches too.
> >> > >> c) Add support (configurable option) to keep track of total disk
> >> > >> time/sectors/count
> >> > >> consumed at each device, and factor that into scheduling decision
> >> > >> (more discussion needed here)
> >> > >> 6 Incorporate an IO tracking approach which can allow tracking cgroups
> >> > >> for asynchronous reads/writes.
> >> > >> 7 Start an offline email thread to keep track of progress on the above
> >> > >> goals.
> >> > >>
> >> > >> Jens, what is your opinion everything beyond (1) in the above list?
> >> > >>
> >> > >> It would be great if work on (1) and (2)-(7) can happen in parallel so
> >> > >> that we can see "proportional division of IO bandwidth to cgroups" in
> >> > >> tree sooner than later.
> >> > >
> >> > > Sounds feasible, I'd like to see the cgroups approach get more traction.
> >> > > My primary concern is just that I don't want to merge it into specific
> >> > > IO schedulers.
> >> >
> >> > Jens,
> >> > So are you saying you don't prefer cgroups based proportional IO
> >> > division solutions in the IO scheduler but at a layer above so it can
> >> > be shared with all IO schedulers?
> >> >
> >> > If yes, then in that case, what do you think about Vivek Goyal's
> >> > patch or dm-ioband that achieve that. Of course, both solutions don't
> >> > meet all the requirements in the list above, but we can work on that
> >> > once we know which direction we should be heading in. In fact, it
> >> > would help if you could express the reservations (if you have any)
> >> > about these approaches. That would help in coming up with a plan that
> >> > everyone agrees on.
> >>
> >> The dm approach has some merrits, the major one being that it'll fit
> >> directly into existing setups that use dm and can be controlled with
> >> familiar tools. That is a bonus. The draw back is partially the same -
> >> it'll require dm. So it's still not a fit-all approach, unfortunately.
> >>
> >> So I'd prefer an approach that doesn't force you to use dm.
> >
> > Hi Jens,
> >
> > My patches met the goal of not using the dm for every device one wants
> > to control.
> >
> > Having said that, few things come to mind.
> >
> > - In what cases do we need to control the higher level logical devices
> > like dm. It looks like real contention for resources is at leaf nodes.
> > Hence any kind of resource management/fair queueing should probably be
> > done at leaf nodes and not at higher level logical nodes.
> >
> > If that makes sense, then probably we don't need to control dm device
> > and we don't need such higher level solutions.
> >
> >
> > - Any kind of 2 level scheduler solution has the potential to break the
> > underlying IO scheduler. Higher level solution requires buffering of
> > bios and controlled release of bios to lower layers. This control breaks
> > the assumptions of lower layer IO scheduler which knows in what order
> > bios should be dispatched to device to meet the semantics exported by
> > the IO scheduler.
> >
> > - 2nd level scheduler does not keep track of tasks but task groups lets
> > every group dispatch fair share. This has got little semantic problem in
> > the sense that tasks and groups in root cgroup will not be considered at
> > same level. "root" will be considered one group at same level with all
> > child group hence competing with them for resources.
> >
> > This looks little odd. Considering tasks and groups same level kind of
> > makes more sense. cpu scheduler also consideres tasks and groups at same
> > level and deviation from that probably is not very good.
> >
> > Considering tasks and groups at same level will matter only if IO
> > scheduler maintains separate queue for the task, like CFQ. Because
> > in that case IO scheduler tries to provide fairness among various task
> > queues. Some schedulers like noop don't have any notion of separate
> > task queues and fairness among them. In that case probably we don't
> > have a choice but to assume root group competing with child groups.
> >
> > Keeping above points in mind, probably two level scheduling is not a
> > very good idea. If putting the code in a particular IO scheduler is a
> > concern we can probably explore ways regarding how we can maximize the
> > sharing of cgroup code among IO schedulers.
> >
> > Thanks
> > Vivek
> >
>
> It seems that we have a solution if we can figure out a way to share
> cgroup code between different schedulers. I am thinking how other
> schedulers (AS, Deadline, No-op) would use cgroups. Will they have
> proportional division between requests from different cgroups? And use
> their own policy (e.g deadline scheduling) within a cgroup? How about
> if we have both threads and cgroups at a particular level? I think
> putting all threads in a default cgroup seems like a reasonable choice
> in this case.
>
> Here is a high level design that comes to mind.
>
> Put proportional division code and state in common code. Each level of
> the hierarchy which has more than one cgroup would have some state
> maintained in common code. At leaf level of hiearchy, we can have a
> cgroup specific scheduler (created when a cgroup is created). We can
> choose a different scheduler for each cgroup (we can have a no-op for
> one cgroup while cfq for another).
I am not sure that I understand the different scheduler for each cgroup
aspect of it. What's the need? It makes things even more complicated I
think.
But moving proportional division code out of particular scheduler and make
it common makes sense.
Looking at BFQ, I was thinking that we can just keep large part of the
code. This common code can think of everything as scheduling entity. This
scheduling entity (SE) will be defined by underlying scheduler depending on
how queue management is done by underlying scheduler. So for CFQ, at
each level, an SE can be either task or group. For the schedulers which
don't maintain separate queues for tasks, it will simply be group at all
levels.
We probably can employ B-WFQ2+ to provide hierarchical fairness between
secheduling entities of this tree. Common layer will do the scheduling of
entities (without knowing what is contained inside) and underlying scheduler
will take care of dispatching the requests from the scheduled entity.
(It could be a task queue for CFQ or a group queue for other schedulers).
The tricky part would be how to abstract it in a clean way. It should lead
to reduced code in CFQ/BFQ because B-WFQ2+ logic will be put into a
common layer (for large part).
>
> The scheduler gets a callback (just like it does right now from
> driver) when the common code schedules a time slice (or budget) from
> that cgroup. No buffering/queuing is done in the common code, so it
> will still be a single level scheduler. When a request arrives, it is
> routed to its scheduler's queues based on its cgroup.
>
> The common code proportional scheduler uses specified weights to
> schedule time slices. Please let me know if it makes any sense. And
> then we can start talking about lower level details.
We can use either time slices or budgets (may be configurable) depending
on which gives better results.
Thanks
Vivek
On Tue, Nov 18, 2008 at 03:41:39PM +0100, Fabio Checconi wrote:
> > From: Vivek Goyal <[email protected]>
> > Date: Tue, Nov 18, 2008 09:07:51AM -0500
> >
> > On Tue, Nov 18, 2008 at 01:05:08PM +0100, Fabio Checconi wrote:
> ...
> > > I have to think a little bit on how it would be possible to support
> > > an option for time-only budgets, coexisting with the current behavior,
> > > but I think it can be done.
> > >
> >
> > IIUC, bfq and cfq are different in following manner.
> >
> > a. BFQ employs WF2Q+ for fairness and CFQ employes weighted round robin.
> > b. BFQ uses the budget (sector count) as notion of service and CFQ uses
> > time slices.
> > c. BFQ supports hierarchical fair queuing and CFQ does not.
> >
> > We are looking forward for implementation of point C. Fabio seems to
> > thinking of supporting time slice as a service (B). It seems like
> > convergence of CFQ and BFQ except the point A (WF2Q+ vs weighted round
> > robin).
> >
> > It looks like WF2Q+ provides tighter service bound and bfq guys mention
> > that they have been able to ensure throughput while ensuring tighter
> > bounds. If that's the case, does that mean BFQ is a replacement for CFQ
> > down the line?
> >
>
> BFQ started from CFQ, extending it in the way you correctly describe,
> so it is indeed very similar. There are also some minor changes to
> locking, cic handling, hw_tag detection and to the CIC_SEEKY heuristic.
>
> The two schedulers share similar goals, and in my opinion BFQ can be
> considered, in the long term, a CFQ replacement; *but* before talking
> about replacing CFQ we have to consider that:
>
> - it *needs* review and testing; we've done our best, but for sure
> it's not enough; review and testing are never enough;
> - the service domain fairness, which was one of our objectives, requires
> some extra complexity; the mechanisms we used and the design choices
> we've made may not fit all the needs, or may not be as generic as the
> simpler CFQ's ones;
> - CFQ has years of history behind and has been tuned for a wider
> variety of environments than the ones we've been able to test.
>
> If time-based fairness is considered more robust and the loss of
> service-domain fairness is not a problem, then the two schedulers can
> be made even more similar.
Hi Fabio,
I though will give bfq a try. I get following when I put my current shell
into a newly created cgroup and then try to do "ls".
Thanks
Vivek
[ 1246.498412] BUG: unable to handle kernel NULL pointer dereference at 000000bc
[ 1246.498674] IP: [<c034210b>] __bfq_cic_change_cgroup+0x148/0x239
[ 1246.498674] *pde = 00000000
[ 1246.498674] Oops: 0002 [#1] SMP
[ 1246.498674] last sysfs file: /sys/devices/pci0000:00/0000:00:01.1/host0/target0:0:1/0:0:1:0/block/sdb/queue/scheduler
[ 1246.498674] Modules linked in:
[ 1246.498674]
[ 1246.498674] Pid: 2352, comm: dd Not tainted (2.6.28-rc4-bfq #2)
[ 1246.498674] EIP: 0060:[<c034210b>] EFLAGS: 00200046 CPU: 0
[ 1246.498674] EIP is at __bfq_cic_change_cgroup+0x148/0x239
[ 1246.498674] EAX: df0e50ac EBX: df0e5000 ECX: 00200046 EDX: df32f300
[ 1246.498674] ESI: dece6ee0 EDI: df0e5000 EBP: df37fc14 ESP: df37fbdc
[ 1246.498674] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
[ 1246.498674] Process dd (pid: 2352, ti=df37e000 task=dfb01e00 task.ti=df37e000)
[ 1246.498674] Stack:
[ 1246.498674] decc9780 dfa98948 df32f300 00000000 00000000 00200046 dece6ef0 00000000
[ 1246.498674] df0e5000 00000000 df0f8014 df32f300 dfa98948 dec0c548 df37fc54 c034351b
[ 1246.498674] 00000010 dfabe6c0 dec0c548 00080000 df32f300 00000001 00200246 dfa98988
[ 1246.498674] Call Trace:
[ 1246.498674] [<c034351b>] ? bfq_set_request+0x1f5/0x291
[ 1246.498674] [<c0343326>] ? bfq_set_request+0x0/0x291
[ 1246.498674] [<c0333ffe>] ? elv_set_request+0x17/0x26
[ 1246.498674] [<c03365ad>] ? get_request+0x15e/0x1e7
[ 1246.498674] [<c0336af5>] ? get_request_wait+0x22/0xd8
[ 1246.498674] [<c04943b9>] ? dm_merge_bvec+0x88/0xb5
[ 1246.498674] [<c0336f31>] ? __make_request+0x25e/0x310
[ 1246.498674] [<c0494c02>] ? dm_request+0x137/0x150
[ 1246.498674] [<c0335ecf>] ? generic_make_request+0x1e9/0x21f
[ 1246.498674] [<c033708b>] ? submit_bio+0xa8/0xb1
[ 1246.498674] [<c0264e49>] ? get_page+0x8/0xe
[ 1246.498674] [<c0265157>] ? __lru_cache_add+0x27/0x43
[ 1246.498674] [<c029fea2>] ? mpage_end_io_read+0x0/0x70
[ 1246.498674] [<c029f453>] ? mpage_bio_submit+0x1c/0x21
[ 1246.498674] [<c029ffc3>] ? mpage_readpages+0xb1/0xbe
[ 1246.498674] [<c02c04d6>] ? ext3_readpages+0x0/0x16
[ 1246.498674] [<c02c04ea>] ? ext3_readpages+0x14/0x16
[ 1246.498674] [<c02c0f4a>] ? ext3_get_block+0x0/0xd4
[ 1246.498674] [<c02649ee>] ? __do_page_cache_readahead+0xde/0x15b
[ 1246.498674] [<c0264cab>] ? ondemand_readahead+0xf9/0x107
[ 1246.498674] [<c0264d1e>] ? page_cache_sync_readahead+0x16/0x1c
[ 1246.498674] [<c02600b2>] ? generic_file_aio_read+0x1ad/0x463
[ 1246.498674] [<c02811cb>] ? do_sync_read+0xab/0xe9
[ 1246.498674] [<c0235fe4>] ? autoremove_wake_function+0x0/0x33
[ 1246.498674] [<c0268f15>] ? __inc_zone_page_state+0x12/0x15
[ 1246.498674] [<c026c1a9>] ? handle_mm_fault+0x5a0/0x5b5
[ 1246.498674] [<c0314bcc>] ? security_file_permission+0xf/0x11
[ 1246.498674] [<c0281949>] ? vfs_read+0x80/0xda
[ 1246.498674] [<c0281120>] ? do_sync_read+0x0/0xe9
[ 1246.498674] [<c0281bab>] ? sys_read+0x3b/0x5d
[ 1246.498674] [<c0203a3d>] ? sysenter_do_call+0x12/0x21
[ 1246.498674] Code: 55 e4 8b 55 d0 89 f0 e8 72 ea ff ff 85 c0 74 04 0f 0b eb fe 8d 46 10 89 45 e0 e8 57 a5 28 00 89 45 dc 8b 55 d0 8d 83 ac 00 00 00 <89> 15 bc 00 00 00 8d 56 14 e8 18 e9 ff ff 8b 75 d0 8d 93 b4 00
[ 1246.498674] EIP: [<c034210b>] __bfq_cic_change_cgroup+0x148/0x239 SS:ESP 0068:df37fbdc
[ 1246.498674] ---[ end trace 6bd1df99b7a9cb00 ]---
On Thu, Nov 20, 2008 at 1:15 PM, Vivek Goyal <[email protected]> wrote:
> On Thu, Nov 20, 2008 at 11:54:14AM -0800, Nauman Rafique wrote:
>> On Thu, Nov 20, 2008 at 5:40 AM, Vivek Goyal <[email protected]> wrote:
>> > On Thu, Nov 20, 2008 at 09:16:41AM +0100, Jens Axboe wrote:
>> >> On Wed, Nov 19 2008, Divyesh Shah wrote:
>> >> > On Wed, Nov 19, 2008 at 6:24 AM, Jens Axboe <[email protected]> wrote:
>> >> > > On Tue, Nov 18 2008, Nauman Rafique wrote:
>> >> > >> On Tue, Nov 18, 2008 at 11:12 AM, Jens Axboe <[email protected]> wrote:
>> >> > >> > On Tue, Nov 18 2008, Fabio Checconi wrote:
>> >> > >> >> > From: Vivek Goyal <[email protected]>
>> >> > >> >> > Date: Tue, Nov 18, 2008 09:07:51AM -0500
>> >> > >> >> >
>> >> > >> >> > On Tue, Nov 18, 2008 at 01:05:08PM +0100, Fabio Checconi wrote:
>> >> > >> >> ...
>> >> > >> >> > > I have to think a little bit on how it would be possible to support
>> >> > >> >> > > an option for time-only budgets, coexisting with the current behavior,
>> >> > >> >> > > but I think it can be done.
>> >> > >> >> > >
>> >> > >> >> >
>> >> > >> >> > IIUC, bfq and cfq are different in following manner.
>> >> > >> >> >
>> >> > >> >> > a. BFQ employs WF2Q+ for fairness and CFQ employes weighted round robin.
>> >> > >> >> > b. BFQ uses the budget (sector count) as notion of service and CFQ uses
>> >> > >> >> > time slices.
>> >> > >> >> > c. BFQ supports hierarchical fair queuing and CFQ does not.
>> >> > >> >> >
>> >> > >> >> > We are looking forward for implementation of point C. Fabio seems to
>> >> > >> >> > thinking of supporting time slice as a service (B). It seems like
>> >> > >> >> > convergence of CFQ and BFQ except the point A (WF2Q+ vs weighted round
>> >> > >> >> > robin).
>> >> > >> >> >
>> >> > >> >> > It looks like WF2Q+ provides tighter service bound and bfq guys mention
>> >> > >> >> > that they have been able to ensure throughput while ensuring tighter
>> >> > >> >> > bounds. If that's the case, does that mean BFQ is a replacement for CFQ
>> >> > >> >> > down the line?
>> >> > >> >> >
>> >> > >> >>
>> >> > >> >> BFQ started from CFQ, extending it in the way you correctly describe,
>> >> > >> >> so it is indeed very similar. There are also some minor changes to
>> >> > >> >> locking, cic handling, hw_tag detection and to the CIC_SEEKY heuristic.
>> >> > >> >>
>> >> > >> >> The two schedulers share similar goals, and in my opinion BFQ can be
>> >> > >> >> considered, in the long term, a CFQ replacement; *but* before talking
>> >> > >> >> about replacing CFQ we have to consider that:
>> >> > >> >>
>> >> > >> >> - it *needs* review and testing; we've done our best, but for sure
>> >> > >> >> it's not enough; review and testing are never enough;
>> >> > >> >> - the service domain fairness, which was one of our objectives, requires
>> >> > >> >> some extra complexity; the mechanisms we used and the design choices
>> >> > >> >> we've made may not fit all the needs, or may not be as generic as the
>> >> > >> >> simpler CFQ's ones;
>> >> > >> >> - CFQ has years of history behind and has been tuned for a wider
>> >> > >> >> variety of environments than the ones we've been able to test.
>> >> > >> >>
>> >> > >> >> If time-based fairness is considered more robust and the loss of
>> >> > >> >> service-domain fairness is not a problem, then the two schedulers can
>> >> > >> >> be made even more similar.
>> >> > >> >
>> >> > >> > My preferred approach here would be, in order or TODO:
>> >> > >> >
>> >> > >> > - Create and test the smallish patches for seekiness, hw_tag checking,
>> >> > >> > and so on for CFQ.
>> >> > >> > - Create and test a WF2Q+ service dispatching patch for CFQ.
>> >> > >> >
>> >> > >> > and if there are leftovers after that, we could even conditionally
>> >> > >> > enable some of those if appropriate. I think the WF2Q+ is quite cool and
>> >> > >> > could be easily usable as the default, so it's definitely a viable
>> >> > >> > alternative.
>> >> > >>
>> >> > >> 1 Merge BFQ into CFQ (Jens and Fabio). I am assuming that this would
>> >> > >> result in time slices being scheduled using WF2Q+
>> >> > >
>> >> > > Yep, at least that is my preference.
>> >> > >
>> >> > >> 2 Do the following to support proportional division:
>> >> > >> a) Expose the per device weight interface to user, instead of calculating
>> >> > >> from priority.
>> >> > >> b) Add support for scheduling bandwidth among a hierarchy of cgroups
>> >> > >> (besides threads)
>> >> > >> 3 Do the following to support the goals of 2 level schedulers:
>> >> > >> a) Limit the request descriptors allocated to each cgroup by adding
>> >> > >> functionality to elv_may_queue()
>> >> > >> b) Add support for putting an absolute limit on IO consumed by a
>> >> > >> cgroup. Such support is provided by Andrea
>> >> > >> Righi's patches too.
>> >> > >> c) Add support (configurable option) to keep track of total disk
>> >> > >> time/sectors/count
>> >> > >> consumed at each device, and factor that into scheduling decision
>> >> > >> (more discussion needed here)
>> >> > >> 6 Incorporate an IO tracking approach which can allow tracking cgroups
>> >> > >> for asynchronous reads/writes.
>> >> > >> 7 Start an offline email thread to keep track of progress on the above
>> >> > >> goals.
>> >> > >>
>> >> > >> Jens, what is your opinion everything beyond (1) in the above list?
>> >> > >>
>> >> > >> It would be great if work on (1) and (2)-(7) can happen in parallel so
>> >> > >> that we can see "proportional division of IO bandwidth to cgroups" in
>> >> > >> tree sooner than later.
>> >> > >
>> >> > > Sounds feasible, I'd like to see the cgroups approach get more traction.
>> >> > > My primary concern is just that I don't want to merge it into specific
>> >> > > IO schedulers.
>> >> >
>> >> > Jens,
>> >> > So are you saying you don't prefer cgroups based proportional IO
>> >> > division solutions in the IO scheduler but at a layer above so it can
>> >> > be shared with all IO schedulers?
>> >> >
>> >> > If yes, then in that case, what do you think about Vivek Goyal's
>> >> > patch or dm-ioband that achieve that. Of course, both solutions don't
>> >> > meet all the requirements in the list above, but we can work on that
>> >> > once we know which direction we should be heading in. In fact, it
>> >> > would help if you could express the reservations (if you have any)
>> >> > about these approaches. That would help in coming up with a plan that
>> >> > everyone agrees on.
>> >>
>> >> The dm approach has some merrits, the major one being that it'll fit
>> >> directly into existing setups that use dm and can be controlled with
>> >> familiar tools. That is a bonus. The draw back is partially the same -
>> >> it'll require dm. So it's still not a fit-all approach, unfortunately.
>> >>
>> >> So I'd prefer an approach that doesn't force you to use dm.
>> >
>> > Hi Jens,
>> >
>> > My patches met the goal of not using the dm for every device one wants
>> > to control.
>> >
>> > Having said that, few things come to mind.
>> >
>> > - In what cases do we need to control the higher level logical devices
>> > like dm. It looks like real contention for resources is at leaf nodes.
>> > Hence any kind of resource management/fair queueing should probably be
>> > done at leaf nodes and not at higher level logical nodes.
>> >
>> > If that makes sense, then probably we don't need to control dm device
>> > and we don't need such higher level solutions.
>> >
>> >
>> > - Any kind of 2 level scheduler solution has the potential to break the
>> > underlying IO scheduler. Higher level solution requires buffering of
>> > bios and controlled release of bios to lower layers. This control breaks
>> > the assumptions of lower layer IO scheduler which knows in what order
>> > bios should be dispatched to device to meet the semantics exported by
>> > the IO scheduler.
>> >
>> > - 2nd level scheduler does not keep track of tasks but task groups lets
>> > every group dispatch fair share. This has got little semantic problem in
>> > the sense that tasks and groups in root cgroup will not be considered at
>> > same level. "root" will be considered one group at same level with all
>> > child group hence competing with them for resources.
>> >
>> > This looks little odd. Considering tasks and groups same level kind of
>> > makes more sense. cpu scheduler also consideres tasks and groups at same
>> > level and deviation from that probably is not very good.
>> >
>> > Considering tasks and groups at same level will matter only if IO
>> > scheduler maintains separate queue for the task, like CFQ. Because
>> > in that case IO scheduler tries to provide fairness among various task
>> > queues. Some schedulers like noop don't have any notion of separate
>> > task queues and fairness among them. In that case probably we don't
>> > have a choice but to assume root group competing with child groups.
>> >
>> > Keeping above points in mind, probably two level scheduling is not a
>> > very good idea. If putting the code in a particular IO scheduler is a
>> > concern we can probably explore ways regarding how we can maximize the
>> > sharing of cgroup code among IO schedulers.
>> >
>> > Thanks
>> > Vivek
>> >
>>
>> It seems that we have a solution if we can figure out a way to share
>> cgroup code between different schedulers. I am thinking how other
>> schedulers (AS, Deadline, No-op) would use cgroups. Will they have
>> proportional division between requests from different cgroups? And use
>> their own policy (e.g deadline scheduling) within a cgroup? How about
>> if we have both threads and cgroups at a particular level? I think
>> putting all threads in a default cgroup seems like a reasonable choice
>> in this case.
>>
>> Here is a high level design that comes to mind.
>>
>> Put proportional division code and state in common code. Each level of
>> the hierarchy which has more than one cgroup would have some state
>> maintained in common code. At leaf level of hiearchy, we can have a
>> cgroup specific scheduler (created when a cgroup is created). We can
>> choose a different scheduler for each cgroup (we can have a no-op for
>> one cgroup while cfq for another).
>
> I am not sure that I understand the different scheduler for each cgroup
> aspect of it. What's the need? It makes things even more complicated I
> think.
With the design I had in my mind, it seemed like that would come for
free. But if it does not, I completely agree with you that its not as
important.
>
> But moving proportional division code out of particular scheduler and make
> it common makes sense.
>
> Looking at BFQ, I was thinking that we can just keep large part of the
> code. This common code can think of everything as scheduling entity. This
> scheduling entity (SE) will be defined by underlying scheduler depending on
> how queue management is done by underlying scheduler. So for CFQ, at
> each level, an SE can be either task or group. For the schedulers which
> don't maintain separate queues for tasks, it will simply be group at all
> levels.
So the structure of hierarchy would be dependent on the underlying scheduler?
>
> We probably can employ B-WFQ2+ to provide hierarchical fairness between
> secheduling entities of this tree. Common layer will do the scheduling of
> entities (without knowing what is contained inside) and underlying scheduler
> will take care of dispatching the requests from the scheduled entity.
> (It could be a task queue for CFQ or a group queue for other schedulers).
>
> The tricky part would be how to abstract it in a clean way. It should lead
> to reduced code in CFQ/BFQ because B-WFQ2+ logic will be put into a
> common layer (for large part).
How about this plan:
1 Start with CFQ patched with some BFQ like patches (This is what we
will have if Jens takes some of Fabio's patches). This will have no
cgroup related logic (correct me if I am wrong).
2 Repeat proportional scheduling logic for cgroups in the common
layer, without touching the code produced in step 1. That means that
we will have WF2Q+ used for scheduling cgroup time slices proportional
to weight in the common code. If CFQ (step 1 output) is used as
scheduler, WF2Q+ would be used there too, but to schedule time slices
(in proportion to priorities?) between different threads. Common code
logic will be completely oblivious of the actual scheduler used
(patched CFQ, Deadline, AS etc).
cgroup tracking has to be implemented as part of step 2. The good
thing is that step 2 can proceed independent of step 1, as the output
of step 1 will have the same interface as the existing CFQ scheduler.
>
>>
>> The scheduler gets a callback (just like it does right now from
>> driver) when the common code schedules a time slice (or budget) from
>> that cgroup. No buffering/queuing is done in the common code, so it
>> will still be a single level scheduler. When a request arrives, it is
>> routed to its scheduler's queues based on its cgroup.
>>
>> The common code proportional scheduler uses specified weights to
>> schedule time slices. Please let me know if it makes any sense. And
>> then we can start talking about lower level details.
>
> We can use either time slices or budgets (may be configurable) depending
> on which gives better results.
>
> Thanks
> Vivek
>
> From: Vivek Goyal <[email protected]>
> Date: Thu, Nov 20, 2008 04:31:55PM -0500
>
...
> Hi Fabio,
>
> I though will give bfq a try. I get following when I put my current shell
> into a newly created cgroup and then try to do "ls".
>
The posted patch cannot work as it is, I'm sorry for that ugly bug.
Do you still have problems with this one applied?
---
diff --git a/block/bfq-cgroup.c b/block/bfq-cgroup.c
index efb03fc..ed8c597 100644
--- a/block/bfq-cgroup.c
+++ b/block/bfq-cgroup.c
@@ -168,7 +168,7 @@ static void bfq_group_chain_link(struct bfq_data *bfqd, struct cgroup *cgroup,
spin_lock_irqsave(&bgrp->lock, flags);
- rcu_assign_pointer(bfqg->bfqd, bfqd);
+ rcu_assign_pointer(leaf->bfqd, bfqd);
hlist_add_head_rcu(&leaf->group_node, &bgrp->group_data);
hlist_add_head(&leaf->bfqd_node, &bfqd->group_list);
On Fri, Nov 21, 2008 at 04:05:33AM +0100, Fabio Checconi wrote:
> > From: Vivek Goyal <[email protected]>
> > Date: Thu, Nov 20, 2008 04:31:55PM -0500
> >
> ...
> > Hi Fabio,
> >
> > I though will give bfq a try. I get following when I put my current shell
> > into a newly created cgroup and then try to do "ls".
> >
>
> The posted patch cannot work as it is, I'm sorry for that ugly bug.
> Do you still have problems with this one applied?
>
> ---
> diff --git a/block/bfq-cgroup.c b/block/bfq-cgroup.c
> index efb03fc..ed8c597 100644
> --- a/block/bfq-cgroup.c
> +++ b/block/bfq-cgroup.c
> @@ -168,7 +168,7 @@ static void bfq_group_chain_link(struct bfq_data *bfqd, struct cgroup *cgroup,
>
> spin_lock_irqsave(&bgrp->lock, flags);
>
> - rcu_assign_pointer(bfqg->bfqd, bfqd);
> + rcu_assign_pointer(leaf->bfqd, bfqd);
> hlist_add_head_rcu(&leaf->group_node, &bgrp->group_data);
> hlist_add_head(&leaf->bfqd_node, &bfqd->group_list);
Thanks Fabio. This fix solves the issue for me.
I did a quick testing and I can see the differential service if I create
two cgroups of different priority. How do I map ioprio to shares? I
mean lets say one cgroup has ioprio 4 and other has got ioprio 7, then
what's the respective share(%) of each cgroup?
Thanks
Vivek
> From: Vivek Goyal <[email protected]>
> Date: Fri, Nov 21, 2008 09:58:23AM -0500
>
> On Fri, Nov 21, 2008 at 04:05:33AM +0100, Fabio Checconi wrote:
> > > From: Vivek Goyal <[email protected]>
> > > Date: Thu, Nov 20, 2008 04:31:55PM -0500
> > >
> > ...
> > > Hi Fabio,
> > >
> > > I though will give bfq a try. I get following when I put my current shell
> > > into a newly created cgroup and then try to do "ls".
> > >
> >
> > The posted patch cannot work as it is, I'm sorry for that ugly bug.
> > Do you still have problems with this one applied?
> >
> > ---
> > diff --git a/block/bfq-cgroup.c b/block/bfq-cgroup.c
> > index efb03fc..ed8c597 100644
> > --- a/block/bfq-cgroup.c
> > +++ b/block/bfq-cgroup.c
> > @@ -168,7 +168,7 @@ static void bfq_group_chain_link(struct bfq_data *bfqd, struct cgroup *cgroup,
> >
> > spin_lock_irqsave(&bgrp->lock, flags);
> >
> > - rcu_assign_pointer(bfqg->bfqd, bfqd);
> > + rcu_assign_pointer(leaf->bfqd, bfqd);
> > hlist_add_head_rcu(&leaf->group_node, &bgrp->group_data);
> > hlist_add_head(&leaf->bfqd_node, &bfqd->group_list);
>
> Thanks Fabio. This fix solves the issue for me.
>
Ok thank you.
> I did a quick testing and I can see the differential service if I create
> two cgroups of different priority. How do I map ioprio to shares? I
> mean lets say one cgroup has ioprio 4 and other has got ioprio 7, then
> what's the respective share(%) of each cgroup?
>
I thought I wrote it somewhere, but maybe I missed that; weights are
mapped linearly, in decreasing order of priority:
weight = 8 - ioprio
[ the calculation is done in bfq_weight_t bfq_ioprio_to_weight() ]
So, with ioprio 4 you have weight 4, and with ioprio 7 you have weight 1.
The shares, as long as the two tasks/groups are active on the disk,
are 4/5 and 1/5 respectively.
This interface is really ugly, but it allows compatible uses of
ioprios with the two schedulers.
On Thu, Nov 20, 2008 at 02:42:38PM -0800, Nauman Rafique wrote:
[..]
> >> It seems that we have a solution if we can figure out a way to share
> >> cgroup code between different schedulers. I am thinking how other
> >> schedulers (AS, Deadline, No-op) would use cgroups. Will they have
> >> proportional division between requests from different cgroups? And use
> >> their own policy (e.g deadline scheduling) within a cgroup? How about
> >> if we have both threads and cgroups at a particular level? I think
> >> putting all threads in a default cgroup seems like a reasonable choice
> >> in this case.
> >>
> >> Here is a high level design that comes to mind.
> >>
> >> Put proportional division code and state in common code. Each level of
> >> the hierarchy which has more than one cgroup would have some state
> >> maintained in common code. At leaf level of hiearchy, we can have a
> >> cgroup specific scheduler (created when a cgroup is created). We can
> >> choose a different scheduler for each cgroup (we can have a no-op for
> >> one cgroup while cfq for another).
> >
> > I am not sure that I understand the different scheduler for each cgroup
> > aspect of it. What's the need? It makes things even more complicated I
> > think.
>
> With the design I had in my mind, it seemed like that would come for
> free. But if it does not, I completely agree with you that its not as
> important.
>
> >
> > But moving proportional division code out of particular scheduler and make
> > it common makes sense.
> >
> > Looking at BFQ, I was thinking that we can just keep large part of the
> > code. This common code can think of everything as scheduling entity. This
> > scheduling entity (SE) will be defined by underlying scheduler depending on
> > how queue management is done by underlying scheduler. So for CFQ, at
> > each level, an SE can be either task or group. For the schedulers which
> > don't maintain separate queues for tasks, it will simply be group at all
> > levels.
>
> So the structure of hierarchy would be dependent on the underlying scheduler?
>
Kind of. In fact it will depend on cgroup hierarchy and dependent on
underlying scheduler.
> >
> > We probably can employ B-WFQ2+ to provide hierarchical fairness between
> > secheduling entities of this tree. Common layer will do the scheduling of
> > entities (without knowing what is contained inside) and underlying scheduler
> > will take care of dispatching the requests from the scheduled entity.
> > (It could be a task queue for CFQ or a group queue for other schedulers).
> >
> > The tricky part would be how to abstract it in a clean way. It should lead
> > to reduced code in CFQ/BFQ because B-WFQ2+ logic will be put into a
> > common layer (for large part).
>
> How about this plan:
> 1 Start with CFQ patched with some BFQ like patches (This is what we
> will have if Jens takes some of Fabio's patches). This will have no
> cgroup related logic (correct me if I am wrong).
> 2 Repeat proportional scheduling logic for cgroups in the common
> layer, without touching the code produced in step 1. That means that
> we will have WF2Q+ used for scheduling cgroup time slices proportional
> to weight in the common code. If CFQ (step 1 output) is used as
> scheduler, WF2Q+ would be used there too, but to schedule time slices
> (in proportion to priorities?) between different threads. Common code
> logic will be completely oblivious of the actual scheduler used
> (patched CFQ, Deadline, AS etc).
I think once you start using WF2Q+ in common layer, CFQ will have to get
rid of that code. (Remember in case of CFQ, we will have a tree which
has got both task and groups as Scheduling Entity). So common layer code
can select the next entity to be dispatched base on WFQ2+ and then
CFQ will decide which request to dispatch with-in that scheduling entity.
So may be we can start with bfq and try to break the code in two pieces.
One common code and one scheduler specific code. Then try to make use
of common code in deadline or anticipatory to see if things work fine. If,
that works, then we can get to CFQ to make use of common code. By that
time CFQ should have Fabio's changes. I think that will include WF2Q+
algorithm also (At least to provide faireness among taks, and not the
hierarchical thing). Once common layer WF2Q+ works well, we can get rid
of WF2Q+ from CFQ and try to complete the picture.
> cgroup tracking has to be implemented as part of step 2. The good
> thing is that step 2 can proceed independent of step 1, as the output
> of step 1 will have the same interface as the existing CFQ scheduler.
>
Agreed. any kind of tracking based on bio and not the task context shall
have to be done later, once we have come up with common layer code.
These are very vague high level ideas. Devil lies in details. :-) I will
get started to see how feasible the common layer code idea is.
Thanks
Vivek
Hi Vivek,
> > > Ryo, do you still want to stick to two level scheduling? Given the problem
> > > of it breaking down underlying scheduler's assumptions, probably it makes
> > > more sense to the IO control at each individual IO scheduler.
> >
> > I don't want to stick to it. I'm considering implementing dm-ioband's
> > algorithm into the block I/O layer experimentally.
>
> Thanks Ryo. Implementing a control at block layer sounds like another
> 2 level scheduling. We will still have the issue of breaking underlying
> CFQ and other schedulers. How to plan to resolve that conflict.
I think there is no conflict against I/O schedulers.
Could you expain to me about the conflict?
> What do you think about the solution at IO scheduler level (like BFQ) or
> may be little above that where one can try some code sharing among IO
> schedulers?
I would like to support any type of block device even if I/Os issued
to the underlying device doesn't go through IO scheduler. Dm-ioband
can be made use of for the devices such as loop device.
Thanks,
Ryo Tsuruta
On Tue, Nov 25, 2008 at 11:33:59AM +0900, Ryo Tsuruta wrote:
> Hi Vivek,
>
> > > > Ryo, do you still want to stick to two level scheduling? Given the problem
> > > > of it breaking down underlying scheduler's assumptions, probably it makes
> > > > more sense to the IO control at each individual IO scheduler.
> > >
> > > I don't want to stick to it. I'm considering implementing dm-ioband's
> > > algorithm into the block I/O layer experimentally.
> >
> > Thanks Ryo. Implementing a control at block layer sounds like another
> > 2 level scheduling. We will still have the issue of breaking underlying
> > CFQ and other schedulers. How to plan to resolve that conflict.
>
> I think there is no conflict against I/O schedulers.
> Could you expain to me about the conflict?
Because we do the buffering at higher level scheduler and mostly release
the buffered bios in the FIFO order, it might break the underlying IO
schedulers. Generally it is the decision of IO scheduler to determine in
what order to release buffered bios.
For example, If there is one task of io priority 0 in a cgroup and rest of
the tasks are of io prio 7. All the tasks belong to best effort class. If
tasks of lower priority (7) do lot of IO, then due to buffering there is
a chance that IO from lower prio tasks is seen by CFQ first and io from
higher prio task is not seen by cfq for quite some time hence that task
not getting it fair share with in the cgroup. Similiar situations can
arise with RT tasks also.
>
> > What do you think about the solution at IO scheduler level (like BFQ) or
> > may be little above that where one can try some code sharing among IO
> > schedulers?
>
> I would like to support any type of block device even if I/Os issued
> to the underlying device doesn't go through IO scheduler. Dm-ioband
> can be made use of for the devices such as loop device.
>
What do you mean by that IO issued to underlying device does not go
through IO scheduler? loop device will be associated with a file and
IO will ultimately go to the IO scheduler which is serving those file
blocks?
What's the use case scenario of doing IO control at loop device?
Ultimately the resource contention will take place on actual underlying
physical device where the file blocks are. Will doing the resource control
there not solve the issue for you?
Thanks
Vivek
On Tue, Nov 25, 2008 at 8:27 AM, Vivek Goyal <[email protected]> wrote:
> On Tue, Nov 25, 2008 at 11:33:59AM +0900, Ryo Tsuruta wrote:
>> Hi Vivek,
>>
>> > > > Ryo, do you still want to stick to two level scheduling? Given the problem
>> > > > of it breaking down underlying scheduler's assumptions, probably it makes
>> > > > more sense to the IO control at each individual IO scheduler.
>> > >
>> > > I don't want to stick to it. I'm considering implementing dm-ioband's
>> > > algorithm into the block I/O layer experimentally.
>> >
>> > Thanks Ryo. Implementing a control at block layer sounds like another
>> > 2 level scheduling. We will still have the issue of breaking underlying
>> > CFQ and other schedulers. How to plan to resolve that conflict.
>>
>> I think there is no conflict against I/O schedulers.
>> Could you expain to me about the conflict?
>
> Because we do the buffering at higher level scheduler and mostly release
> the buffered bios in the FIFO order, it might break the underlying IO
> schedulers. Generally it is the decision of IO scheduler to determine in
> what order to release buffered bios.
>
> For example, If there is one task of io priority 0 in a cgroup and rest of
> the tasks are of io prio 7. All the tasks belong to best effort class. If
> tasks of lower priority (7) do lot of IO, then due to buffering there is
> a chance that IO from lower prio tasks is seen by CFQ first and io from
> higher prio task is not seen by cfq for quite some time hence that task
> not getting it fair share with in the cgroup. Similiar situations can
> arise with RT tasks also.
Wouldn't even anticipation algorithms break if buffering is done at
higher level? Our anticipation algorithms are tuned to model task's
behavior. If IOs get buffer at a higher layer, all bets are off about
anticipation.
>
>>
>> > What do you think about the solution at IO scheduler level (like BFQ) or
>> > may be little above that where one can try some code sharing among IO
>> > schedulers?
>>
>> I would like to support any type of block device even if I/Os issued
>> to the underlying device doesn't go through IO scheduler. Dm-ioband
>> can be made use of for the devices such as loop device.
>>
>
> What do you mean by that IO issued to underlying device does not go
> through IO scheduler? loop device will be associated with a file and
> IO will ultimately go to the IO scheduler which is serving those file
> blocks?
>
> What's the use case scenario of doing IO control at loop device?
> Ultimately the resource contention will take place on actual underlying
> physical device where the file blocks are. Will doing the resource control
> there not solve the issue for you?
>
> Thanks
> Vivek
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
On Thu, 2008-11-20 at 08:40 -0500, Vivek Goyal wrote:
> > The dm approach has some merrits, the major one being that it'll fit
> > directly into existing setups that use dm and can be controlled with
> > familiar tools. That is a bonus. The draw back is partially the same -
> > it'll require dm. So it's still not a fit-all approach, unfortunately.
> >
> > So I'd prefer an approach that doesn't force you to use dm.
>
> Hi Jens,
>
> My patches met the goal of not using the dm for every device one wants
> to control.
>
> Having said that, few things come to mind.
>
> - In what cases do we need to control the higher level logical devices
> like dm. It looks like real contention for resources is at leaf nodes.
> Hence any kind of resource management/fair queueing should probably be
> done at leaf nodes and not at higher level logical nodes.
The problem with stacking devices is that we do not know how the IO
going through the leaf nodes contributes to the aggregate throughput
seen by the application/cgroup that generated it, which is what end
users care about.
The block device could be a plain old sata device, a loop device, a
stacking device, a SSD, you name it, but their topologies and the fact
that some of them do not even use an elevator should be transparent to
the user.
If you wanted to do resource management at the leaf nodes some kind of
topology information should be passed down to the elevators controlling
the underlying devices, which in turn would need to work cooperatively.
> If that makes sense, then probably we don't need to control dm device
> and we don't need such higher level solutions.
For the reasons stated above the two level scheduling approach seems
cleaner to me.
> - Any kind of 2 level scheduler solution has the potential to break the
> underlying IO scheduler. Higher level solution requires buffering of
> bios and controlled release of bios to lower layers. This control breaks
> the assumptions of lower layer IO scheduler which knows in what order
> bios should be dispatched to device to meet the semantics exported by
> the IO scheduler.
Please notice that the such an IO controller would only get in the way
of the elevator in case of contention for the device. What is more,
depending on the workload it turns out that buffering at higher layers
in a per-cgroup or per-task basis, like dm-band does, may actually
increase the aggregate throughput (I think that the dm-band team
observed this behavior too). The reason seems to be that bios buffered
in such way tend to be highly correlated and thus very likely to get
merged when released to the elevator.
> - 2nd level scheduler does not keep track of tasks but task groups lets
> every group dispatch fair share. This has got little semantic problem in
> the sense that tasks and groups in root cgroup will not be considered at
> same level. "root" will be considered one group at same level with all
> child group hence competing with them for resources.
>
> This looks little odd. Considering tasks and groups same level kind of
> makes more sense. cpu scheduler also consideres tasks and groups at same
> level and deviation from that probably is not very good.
>
> Considering tasks and groups at same level will matter only if IO
> scheduler maintains separate queue for the task, like CFQ. Because
> in that case IO scheduler tries to provide fairness among various task
> queues. Some schedulers like noop don't have any notion of separate
> task queues and fairness among them. In that case probably we don't
> have a choice but to assume root group competing with child groups.
If deemed necessary this case could be handled too, but it does not look
like a show-stopper.
> Keeping above points in mind, probably two level scheduling is not a
> very good idea. If putting the code in a particular IO scheduler is a
> concern we can probably explore ways regarding how we can maximize the
> sharing of cgroup code among IO schedulers.
As discussed above, I still think that the two level scheduling approach
makes more sense. Regarding the sharing of cgroup code among IO
schedulers I am all for it. If we consider that elevators should only
care about maximizing usage of the underlying devices, implementing
other non-hardware-dependent scheduling disciplines (that prioritize
according to the task or cgroup that generated the IO, for example) at
higher layers so that we can reuse code makes a lot of sense.
Thanks,
Fernando
On Tue, 2008-11-25 at 11:27 -0500, Vivek Goyal wrote:
> On Tue, Nov 25, 2008 at 11:33:59AM +0900, Ryo Tsuruta wrote:
> > Hi Vivek,
> >
> > > > > Ryo, do you still want to stick to two level scheduling? Given the problem
> > > > > of it breaking down underlying scheduler's assumptions, probably it makes
> > > > > more sense to the IO control at each individual IO scheduler.
> > > >
> > > > I don't want to stick to it. I'm considering implementing dm-ioband's
> > > > algorithm into the block I/O layer experimentally.
> > >
> > > Thanks Ryo. Implementing a control at block layer sounds like another
> > > 2 level scheduling. We will still have the issue of breaking underlying
> > > CFQ and other schedulers. How to plan to resolve that conflict.
> >
> > I think there is no conflict against I/O schedulers.
> > Could you expain to me about the conflict?
>
> Because we do the buffering at higher level scheduler and mostly release
> the buffered bios in the FIFO order, it might break the underlying IO
> schedulers. Generally it is the decision of IO scheduler to determine in
> what order to release buffered bios.
It could be argued that the IO scheduler's primary goal is to maximize
usage of the underlying device according to its physical
characteristics. For hard disks this may imply minimizing time wasted by
seeks; other types of devices, such as SSDs, may impose different
requirements. This is something that clearly belongs in the elevator. On
the other hand, it could be argued that other non-hardware-related
scheduling disciplines would fit better in higher layers.
That said, as you pointed out such separation could impact performance,
so we will probably need to implement a feedback mechanism between the
elevator, which could collect statistics and provide hints, and the
upper layers. The elevator API looks like a good candidate for this,
though new functions might be needed.
> For example, If there is one task of io priority 0 in a cgroup and rest of
> the tasks are of io prio 7. All the tasks belong to best effort class. If
> tasks of lower priority (7) do lot of IO, then due to buffering there is
> a chance that IO from lower prio tasks is seen by CFQ first and io from
> higher prio task is not seen by cfq for quite some time hence that task
> not getting it fair share with in the cgroup. Similiar situations can
> arise with RT tasks also.
Well, this issue is not intrinsic to dm-band and similar solutions. In
the scenario you point out the problem is that the elevator and the IO
controller are not cooperating. The same could happen even if we
implemented everything at the elevator layer (or a little above): get
hierarchical scheduling wrong and you are likely to have a rough ride.
BFQ deals with hierarchical scheduling at just one layer which makes
things easier. BFQ chose the elevator layer, but a similar scheduling
discipline could be implemented higher in the block layer too. The HW
specific-bits we cannot take out the elevator, but when it comes to
task/cgroup based scheduling there are more possibilities, which
includes the middle-way approach we are discussing: two level
scheduling.
The two level model is not bad per se, we just need to get the two
levels to work in unison and for that we will certainly need to make
changes to the existing elevators.
> > > What do you think about the solution at IO scheduler level (like BFQ) or
> > > may be little above that where one can try some code sharing among IO
> > > schedulers?
> >
> > I would like to support any type of block device even if I/Os issued
> > to the underlying device doesn't go through IO scheduler. Dm-ioband
> > can be made use of for the devices such as loop device.
>
> What do you mean by that IO issued to underlying device does not go
> through IO scheduler? loop device will be associated with a file and
> IO will ultimately go to the IO scheduler which is serving those file
> blocks?
I think that Tsuruta-san's point is that the loop device driver uses its
own make_request_fn which means that bios entering a loop device do not
necessarily go through a IO scheduler after that.
We will always find ourselves in this situation when trying to manage
devices that provide their own make_request_fn, the reason being that
its behavior is driver and configuration dependent: in the loop device
case whether we go through a IO scheduler or not depends on what has
been attached to it; in stacking device configurations the effect that
the IO scheduling at one of the devices that constitute the multi-device
will have in the aggregate throughput depends on the topology.
The only way I can think of to address all cases in a sane way is
controlling the entry point to the block layer, which is precisely what
dm-band does.
The problem with dm-band is that it relies on the dm infrastructure. In
my opinion, if we could remove that dependency it would be a huge step
in the right direction.
> What's the use case scenario of doing IO control at loop device?
My guess is virtualized machines using images exported as loop devices à
la Xen's blktap (blktap's implementation is quite different from Linux'
loop device, though).
Thanks,
Fernando
Hi Vivek,
From: Vivek Goyal <[email protected]>
Subject: Re: [patch 0/4] [RFC] Another proportional weight IO controller
Date: Tue, 25 Nov 2008 11:27:20 -0500
> On Tue, Nov 25, 2008 at 11:33:59AM +0900, Ryo Tsuruta wrote:
> > Hi Vivek,
> >
> > > > > Ryo, do you still want to stick to two level scheduling? Given the problem
> > > > > of it breaking down underlying scheduler's assumptions, probably it makes
> > > > > more sense to the IO control at each individual IO scheduler.
> > > >
> > > > I don't want to stick to it. I'm considering implementing dm-ioband's
> > > > algorithm into the block I/O layer experimentally.
> > >
> > > Thanks Ryo. Implementing a control at block layer sounds like another
> > > 2 level scheduling. We will still have the issue of breaking underlying
> > > CFQ and other schedulers. How to plan to resolve that conflict.
> >
> > I think there is no conflict against I/O schedulers.
> > Could you expain to me about the conflict?
>
> Because we do the buffering at higher level scheduler and mostly release
> the buffered bios in the FIFO order, it might break the underlying IO
> schedulers. Generally it is the decision of IO scheduler to determine in
> what order to release buffered bios.
>
> For example, If there is one task of io priority 0 in a cgroup and rest of
> the tasks are of io prio 7. All the tasks belong to best effort class. If
> tasks of lower priority (7) do lot of IO, then due to buffering there is
> a chance that IO from lower prio tasks is seen by CFQ first and io from
> higher prio task is not seen by cfq for quite some time hence that task
> not getting it fair share with in the cgroup. Similiar situations can
> arise with RT tasks also.
Thanks for your explanation.
I think that the same thing occurs without the higher level scheduler,
because all the tasks issuing I/Os are blocked while the underlying
device's request queue is full before those I/Os are sent to the I/O
scheduler.
> > > What do you think about the solution at IO scheduler level (like BFQ) or
> > > may be little above that where one can try some code sharing among IO
> > > schedulers?
> >
> > I would like to support any type of block device even if I/Os issued
> > to the underlying device doesn't go through IO scheduler. Dm-ioband
> > can be made use of for the devices such as loop device.
> >
>
> What do you mean by that IO issued to underlying device does not go
> through IO scheduler? loop device will be associated with a file and
> IO will ultimately go to the IO scheduler which is serving those file
> blocks?
How about if the files is on an NFS-mounted file system?
> What's the use case scenario of doing IO control at loop device?
> Ultimately the resource contention will take place on actual underlying
> physical device where the file blocks are. Will doing the resource control
> there not solve the issue for you?
I don't come up with any use case, but I would like to make the
resource controller more flexible. Actually, a certain block device
that I'm using does not use the I/O scheduler.
Thanks,
Ryo Tsuruta
Fabio and I are a little bit worried about the fact that the problem
of working in the time domain instead of the service domain is not
being properly dealt with. Probably we did not express ourselves very
clearly, so we will try to put in more practical terms. Using B-WF2Q+
in the time domain instead of using CFQ (Round-Robin) means introducing
higher complexity than CFQ to get almost the same service properties
of CFQ. With regard to fairness (long term) B-WF2Q+ in the time domain
has exactly the same (un)fairness problems of CFQ. As far as bandwidth
differentiation is concerned, it can be obtained with CFQ by just
increasing the time slice (e.g., double weight => double slice). This
has no impact on long term guarantees and certainly does not decrease
the throughput.
With regard to short term guarantees (request completion time), one of
the properties of the reference ideal system of Wf2Q+ is that, assuming
for simplicity that all the queues have the same weight, as the ideal
system serves each queue at the same speed, shorter budgets are completed
in a shorter time intervals than longer budgets. B-WF2Q+ guarantees
O(1) deviation from this ideal service. Hence, the tight delay/jitter
measured in our experiments with BFQ is a consequence of the simple (and
probably still improvable) budget assignment mechanism of (the overall)
BFQ. In contrast, if all the budgets are equal, as it happens if we use
time slices, the resulting scheduler is exactly a Round-Robin, again
as in CFQ (see [1]).
Finally, with regard to completion time delay differentiation through
weight differentiation, this is probably the only case in which B-WF2Q+
would perform better than CFQ, because, in case of CFQ, reducing the
time slices may reduce the throughput, whereas increasing the time slice
would increase the worst-case delay/jitter.
In the end, BFQ succeeds in guaranteeing fairness (or in general the
desired bandwidth distribution) because it works in the service domain
(and this is probably the only way to achieve this goal), not because
it uses WF2Q+ instead of Round-Robin. Similarly, it provides tight
delay/jitter only because B-WF2Q+ is used in combination with a simple
budget assignment (differentiation) mechanism (again in the service
domain).
[1] http://feanor.sssup.it/~fabio/linux/bfq/results.php
--
-----------------------------------------------------------
| Paolo Valente | |
| Algogroup | |
| Dip. Ing. Informazione | tel: +39 059 2056318 |
| Via Vignolese 905/b | fax: +39 059 2056199 |
| 41100 Modena | |
| home: http://algo.ing.unimo.it/people/paolo/ |
-----------------------------------------------------------
On Wed, Nov 26, 2008 at 03:40:18PM +0900, Fernando Luis V?zquez Cao wrote:
> On Thu, 2008-11-20 at 08:40 -0500, Vivek Goyal wrote:
> > > The dm approach has some merrits, the major one being that it'll fit
> > > directly into existing setups that use dm and can be controlled with
> > > familiar tools. That is a bonus. The draw back is partially the same -
> > > it'll require dm. So it's still not a fit-all approach, unfortunately.
> > >
> > > So I'd prefer an approach that doesn't force you to use dm.
> >
> > Hi Jens,
> >
> > My patches met the goal of not using the dm for every device one wants
> > to control.
> >
> > Having said that, few things come to mind.
> >
> > - In what cases do we need to control the higher level logical devices
> > like dm. It looks like real contention for resources is at leaf nodes.
> > Hence any kind of resource management/fair queueing should probably be
> > done at leaf nodes and not at higher level logical nodes.
>
> The problem with stacking devices is that we do not know how the IO
> going through the leaf nodes contributes to the aggregate throughput
> seen by the application/cgroup that generated it, which is what end
> users care about.
>
If we keep track of cgroup information in bio and don't loose it while
bio traverses through the stack of devices, then leaf node can still do
the proportional fair share allocation among contending cgroups on that
device.
I think end users care about getting fair share if there is a contention
anywhere along the IO path. Real contention is at leaf nodes. However
complex the logical device topology is, if two applications are not
contending for disk at lowest level, there is no point in doing any kind
of resource management among them. Though the applications seemingly might
be contending for higher level logical device, at leaf nodes, their IOs
might be going to different disk altogether and practically there is no
contention.
> The block device could be a plain old sata device, a loop device, a
> stacking device, a SSD, you name it, but their topologies and the fact
> that some of them do not even use an elevator should be transparent to
> the user.
Are there some devices which don't use elevators at leaf nodes? If no,
then its not a issue.
>
> If you wanted to do resource management at the leaf nodes some kind of
> topology information should be passed down to the elevators controlling
> the underlying devices, which in turn would need to work cooperatively.
>
I am not able to understand why some kind of topology information needs
to be passed to underlying elevators. As long as end device can map a bio
correctly to the right cgroup (irrespective of complex topology) and end
device step into resource management only if there is contention for
resources among cgroups on that device, things are fine. We don't have
to worry about intermediate complex topology.
I will take one hypothetical example. Lets assume there are two cgroups
A and B with weights 2048 and 1024 respectively. To me this information
means that if A, and B really conted for the resources somewhere, then
make sure A gets 2/3 of resources and B gets 1/3 of resource.
Now if tasks in these two groups happen to contend for same disk at lowest
level, we do resource management otherwise we don't. Why do I need to
worry about intermediate logical devices in the IO path?
May be I am missing something. A detailed example will help here...
> > If that makes sense, then probably we don't need to control dm device
> > and we don't need such higher level solutions.
>
> For the reasons stated above the two level scheduling approach seems
> cleaner to me.
>
> > - Any kind of 2 level scheduler solution has the potential to break the
> > underlying IO scheduler. Higher level solution requires buffering of
> > bios and controlled release of bios to lower layers. This control breaks
> > the assumptions of lower layer IO scheduler which knows in what order
> > bios should be dispatched to device to meet the semantics exported by
> > the IO scheduler.
>
> Please notice that the such an IO controller would only get in the way
> of the elevator in case of contention for the device.
True. So are we saying that a user can get expected CFQ or AS behavior
only if there is no contention. If there is contention, then we don't
gurantee anything?
> What is more,
> depending on the workload it turns out that buffering at higher layers
> in a per-cgroup or per-task basis, like dm-band does, may actually
> increase the aggregate throughput (I think that the dm-band team
> observed this behavior too). The reason seems to be that bios buffered
> in such way tend to be highly correlated and thus very likely to get
> merged when released to the elevator.
The goal here is not to increase throughput by doing buffering at higher
layer. This is what IO scheduler currently does. It tries to buffer bios
and select these appropriately to boost throughput. If one needs to focus
on increasing throughput, it should be done at IO scheduler level and
not by introducing one more buffering layer in between.
>
> > - 2nd level scheduler does not keep track of tasks but task groups lets
> > every group dispatch fair share. This has got little semantic problem in
> > the sense that tasks and groups in root cgroup will not be considered at
> > same level. "root" will be considered one group at same level with all
> > child group hence competing with them for resources.
> >
> > This looks little odd. Considering tasks and groups same level kind of
> > makes more sense. cpu scheduler also consideres tasks and groups at same
> > level and deviation from that probably is not very good.
> >
> > Considering tasks and groups at same level will matter only if IO
> > scheduler maintains separate queue for the task, like CFQ. Because
> > in that case IO scheduler tries to provide fairness among various task
> > queues. Some schedulers like noop don't have any notion of separate
> > task queues and fairness among them. In that case probably we don't
> > have a choice but to assume root group competing with child groups.
>
> If deemed necessary this case could be handled too, but it does not look
> like a show-stopper.
>
It is not a show stopper for sure. But it can be a genuine concern in case
of CFQ atleast which tries to provide fairness among tasks.
Think of following scenario. (Diagram taken from peterz's mail).
root
/ | \
1 2 A
/ \
B 3
Assume that task 1, task 2 and group A belong to Best effort class and they
all have the same priority. If we go for two level scheduling than, disk BW
will be divided in the ratio of 25%, 25% and 50% between task 1, task 2 and
group A.
I think it should instead be 33% each. Again coming back to the idea of
treating 1, 2 and A at same level.
So this is not a show stopper but once you go for one approach, swithing
to another will become really hard as it might require close interatction
with underlying scheduler and fundamentally 2 level scheduler will find
it very hard to communicate with IO scheduler.
> > Keeping above points in mind, probably two level scheduling is not a
> > very good idea. If putting the code in a particular IO scheduler is a
> > concern we can probably explore ways regarding how we can maximize the
> > sharing of cgroup code among IO schedulers.
>
> As discussed above, I still think that the two level scheduling approach
> makes more sense.
IMHO, two level scheduling approach makes a case only if resource
management at leaf nodes does not solve the requirements. So far we
have not got a concrete example where resource management at intermediate
logical devices is needed and resource management at leaf nodes is not
sufficient.
Thanks
Vivek
> Regarding the sharing of cgroup code among IO
> schedulers I am all for it. If we consider that elevators should only
> care about maximizing usage of the underlying devices, implementing
> other non-hardware-dependent scheduling disciplines (that prioritize
> according to the task or cgroup that generated the IO, for example) at
> higher layers so that we can reuse code makes a lot of sense.
>
> Thanks,
>
> Fernando
On Wed, Nov 26, 2008 at 09:47:07PM +0900, Ryo Tsuruta wrote:
> Hi Vivek,
>
> From: Vivek Goyal <[email protected]>
> Subject: Re: [patch 0/4] [RFC] Another proportional weight IO controller
> Date: Tue, 25 Nov 2008 11:27:20 -0500
>
> > On Tue, Nov 25, 2008 at 11:33:59AM +0900, Ryo Tsuruta wrote:
> > > Hi Vivek,
> > >
> > > > > > Ryo, do you still want to stick to two level scheduling? Given the problem
> > > > > > of it breaking down underlying scheduler's assumptions, probably it makes
> > > > > > more sense to the IO control at each individual IO scheduler.
> > > > >
> > > > > I don't want to stick to it. I'm considering implementing dm-ioband's
> > > > > algorithm into the block I/O layer experimentally.
> > > >
> > > > Thanks Ryo. Implementing a control at block layer sounds like another
> > > > 2 level scheduling. We will still have the issue of breaking underlying
> > > > CFQ and other schedulers. How to plan to resolve that conflict.
> > >
> > > I think there is no conflict against I/O schedulers.
> > > Could you expain to me about the conflict?
> >
> > Because we do the buffering at higher level scheduler and mostly release
> > the buffered bios in the FIFO order, it might break the underlying IO
> > schedulers. Generally it is the decision of IO scheduler to determine in
> > what order to release buffered bios.
> >
> > For example, If there is one task of io priority 0 in a cgroup and rest of
> > the tasks are of io prio 7. All the tasks belong to best effort class. If
> > tasks of lower priority (7) do lot of IO, then due to buffering there is
> > a chance that IO from lower prio tasks is seen by CFQ first and io from
> > higher prio task is not seen by cfq for quite some time hence that task
> > not getting it fair share with in the cgroup. Similiar situations can
> > arise with RT tasks also.
>
> Thanks for your explanation.
> I think that the same thing occurs without the higher level scheduler,
> because all the tasks issuing I/Os are blocked while the underlying
> device's request queue is full before those I/Os are sent to the I/O
> scheduler.
>
True and this issue was pointed out by Divyesh. I think we shall have to
fix this by allocating the request descriptors in proportion to their
share. One possible way is to make use of elv_may_queue() to determine
if we can allocate furhter request descriptors or not.
> > > > What do you think about the solution at IO scheduler level (like BFQ) or
> > > > may be little above that where one can try some code sharing among IO
> > > > schedulers?
> > >
> > > I would like to support any type of block device even if I/Os issued
> > > to the underlying device doesn't go through IO scheduler. Dm-ioband
> > > can be made use of for the devices such as loop device.
> > >
> >
> > What do you mean by that IO issued to underlying device does not go
> > through IO scheduler? loop device will be associated with a file and
> > IO will ultimately go to the IO scheduler which is serving those file
> > blocks?
>
> How about if the files is on an NFS-mounted file system?
>
Interesting. So on the surface it looks like contention for disk but it
is more the contention for network and contention for disk on NFS server.
True that leaf node IO control will not help here as IO is not going to
leaf node at all. We can make the situation better by doing resource
control on network IO though.
> > What's the use case scenario of doing IO control at loop device?
> > Ultimately the resource contention will take place on actual underlying
> > physical device where the file blocks are. Will doing the resource control
> > there not solve the issue for you?
>
> I don't come up with any use case, but I would like to make the
> resource controller more flexible. Actually, a certain block device
> that I'm using does not use the I/O scheduler.
Isn't it equivalent to using No-op? If yes, then it should not be an
issue?
Thanks
Vivek
On Wed, Nov 26, 2008 at 6:06 AM, Paolo Valente <[email protected]> wrote:
> Fabio and I are a little bit worried about the fact that the problem
> of working in the time domain instead of the service domain is not
> being properly dealt with. Probably we did not express ourselves very
> clearly, so we will try to put in more practical terms. Using B-WF2Q+
> in the time domain instead of using CFQ (Round-Robin) means introducing
> higher complexity than CFQ to get almost the same service properties
> of CFQ. With regard to fairness (long term) B-WF2Q+ in the time domain
Are we talking about a case where all the contenders have equal
weights and are continuously backlogged? That seems to be the only
case when B-WF2Q+ would behave like Round-Robin. Am I missing
something here?
I can see that the only direct advantage of using WF2Q+ scheduling is
reduced jitter or latency in certain cases. But under heavy loads,
that might result in request latencies seen by RT threads to be
reduced from a few seconds to a few msec.
> has exactly the same (un)fairness problems of CFQ. As far as bandwidth
> differentiation is concerned, it can be obtained with CFQ by just
> increasing the time slice (e.g., double weight => double slice). This
> has no impact on long term guarantees and certainly does not decrease
> the throughput.
>
> With regard to short term guarantees (request completion time), one of
> the properties of the reference ideal system of Wf2Q+ is that, assuming
> for simplicity that all the queues have the same weight, as the ideal
> system serves each queue at the same speed, shorter budgets are completed
> in a shorter time intervals than longer budgets. B-WF2Q+ guarantees
> O(1) deviation from this ideal service. Hence, the tight delay/jitter
> measured in our experiments with BFQ is a consequence of the simple (and
> probably still improvable) budget assignment mechanism of (the overall)
> BFQ. In contrast, if all the budgets are equal, as it happens if we use
> time slices, the resulting scheduler is exactly a Round-Robin, again
> as in CFQ (see [1]).
Can the budget assignment mechanism of BFQ be converted to time slice
assignment mechanism? What I am trying to say here is that we can have
variable time slices, just like we have variable budgets.
>
> Finally, with regard to completion time delay differentiation through
> weight differentiation, this is probably the only case in which B-WF2Q+
> would perform better than CFQ, because, in case of CFQ, reducing the
> time slices may reduce the throughput, whereas increasing the time slice
> would increase the worst-case delay/jitter.
>
> In the end, BFQ succeeds in guaranteeing fairness (or in general the
> desired bandwidth distribution) because it works in the service domain
> (and this is probably the only way to achieve this goal), not because
> it uses WF2Q+ instead of Round-Robin. Similarly, it provides tight
> delay/jitter only because B-WF2Q+ is used in combination with a simple
> budget assignment (differentiation) mechanism (again in the service
> domain).
>
> [1] http://feanor.sssup.it/~fabio/linux/bfq/results.php
>
> --
> -----------------------------------------------------------
> | Paolo Valente | |
> | Algogroup | |
> | Dip. Ing. Informazione | tel: +39 059 2056318 |
> | Via Vignolese 905/b | fax: +39 059 2056199 |
> | 41100 Modena | |
> | home: http://algo.ing.unimo.it/people/paolo/ |
> -----------------------------------------------------------
>
>
> From: Nauman Rafique <[email protected]>
> Date: Wed, Nov 26, 2008 11:41:46AM -0800
>
> On Wed, Nov 26, 2008 at 6:06 AM, Paolo Valente <[email protected]> wrote:
> > Fabio and I are a little bit worried about the fact that the problem
> > of working in the time domain instead of the service domain is not
> > being properly dealt with. Probably we did not express ourselves very
> > clearly, so we will try to put in more practical terms. Using B-WF2Q+
> > in the time domain instead of using CFQ (Round-Robin) means introducing
> > higher complexity than CFQ to get almost the same service properties
> > of CFQ. With regard to fairness (long term) B-WF2Q+ in the time domain
>
> Are we talking about a case where all the contenders have equal
> weights and are continuously backlogged? That seems to be the only
> case when B-WF2Q+ would behave like Round-Robin. Am I missing
> something here?
>
It is the case with equal weights, but it is really a common one.
> I can see that the only direct advantage of using WF2Q+ scheduling is
> reduced jitter or latency in certain cases. But under heavy loads,
> that might result in request latencies seen by RT threads to be
> reduced from a few seconds to a few msec.
>
> > has exactly the same (un)fairness problems of CFQ. As far as bandwidth
> > differentiation is concerned, it can be obtained with CFQ by just
> > increasing the time slice (e.g., double weight => double slice). This
> > has no impact on long term guarantees and certainly does not decrease
> > the throughput.
> >
> > With regard to short term guarantees (request completion time), one of
> > the properties of the reference ideal system of Wf2Q+ is that, assuming
> > for simplicity that all the queues have the same weight, as the ideal
> > system serves each queue at the same speed, shorter budgets are completed
> > in a shorter time intervals than longer budgets. B-WF2Q+ guarantees
> > O(1) deviation from this ideal service. Hence, the tight delay/jitter
> > measured in our experiments with BFQ is a consequence of the simple (and
> > probably still improvable) budget assignment mechanism of (the overall)
> > BFQ. In contrast, if all the budgets are equal, as it happens if we use
> > time slices, the resulting scheduler is exactly a Round-Robin, again
> > as in CFQ (see [1]).
>
> Can the budget assignment mechanism of BFQ be converted to time slice
> assignment mechanism? What I am trying to say here is that we can have
> variable time slices, just like we have variable budgets.
>
Yes, it could be converted, and it would do in the time domain the
same differentiation it does now in the service domain. What we would
lose in the process is the fairness in the service domain. The service
properties/guarantees of the resulting scheduler would _not_ be the same
as the BFQ ones. Both long term and short term guarantees would be
affected by the unfairness given by the different service rate
experienced by the scheduled entities.
> >
> > Finally, with regard to completion time delay differentiation through
> > weight differentiation, this is probably the only case in which B-WF2Q+
> > would perform better than CFQ, because, in case of CFQ, reducing the
> > time slices may reduce the throughput, whereas increasing the time slice
> > would increase the worst-case delay/jitter.
> >
> > In the end, BFQ succeeds in guaranteeing fairness (or in general the
> > desired bandwidth distribution) because it works in the service domain
> > (and this is probably the only way to achieve this goal), not because
> > it uses WF2Q+ instead of Round-Robin. Similarly, it provides tight
> > delay/jitter only because B-WF2Q+ is used in combination with a simple
> > budget assignment (differentiation) mechanism (again in the service
> > domain).
> >
> > [1] http://feanor.sssup.it/~fabio/linux/bfq/results.php
> >
> > --
> > -----------------------------------------------------------
> > | Paolo Valente | |
> > | Algogroup | |
> > | Dip. Ing. Informazione | tel: +39 059 2056318 |
> > | Via Vignolese 905/b | fax: +39 059 2056199 |
> > | 41100 Modena | |
> > | home: http://algo.ing.unimo.it/people/paolo/ |
> > -----------------------------------------------------------
> >
> >
On Wed, 2008-11-26 at 11:08 -0500, Vivek Goyal wrote:
> > > > > What do you think about the solution at IO scheduler level (like BFQ) or
> > > > > may be little above that where one can try some code sharing among IO
> > > > > schedulers?
> > > >
> > > > I would like to support any type of block device even if I/Os issued
> > > > to the underlying device doesn't go through IO scheduler. Dm-ioband
> > > > can be made use of for the devices such as loop device.
> > > >
> > >
> > > What do you mean by that IO issued to underlying device does not go
> > > through IO scheduler? loop device will be associated with a file and
> > > IO will ultimately go to the IO scheduler which is serving those file
> > > blocks?
> >
> > How about if the files is on an NFS-mounted file system?
> >
>
> Interesting. So on the surface it looks like contention for disk but it
> is more the contention for network and contention for disk on NFS server.
>
> True that leaf node IO control will not help here as IO is not going to
> leaf node at all. We can make the situation better by doing resource
> control on network IO though.
On the client side NFS does not go through the block layer so no control
is possible there. As Vivek pointed out this could be tackled at the
network layer. Though I guess we could make do with a solution that
controls just the number of dirty pages (this would work for NFS writes
since the NFS superblock has a backing_device_info structure associated
with it).
> > > What's the use case scenario of doing IO control at loop device?
> > > Ultimately the resource contention will take place on actual underlying
> > > physical device where the file blocks are. Will doing the resource control
> > > there not solve the issue for you?
> >
> > I don't come up with any use case, but I would like to make the
> > resource controller more flexible. Actually, a certain block device
> > that I'm using does not use the I/O scheduler.
>
> Isn't it equivalent to using No-op? If yes, then it should not be an
> issue?
No, it is not equivalent. When using devices drivers that provide their
own make_request_fn() (check for devices that invoke
blk_queue_make_request() at initialization time) bios entering the block
layer can go directly to the device driver and from there to the device.
Regards,
Fernando
Hi,
> > > I don't come up with any use case, but I would like to make the
> > > resource controller more flexible. Actually, a certain block device
> > > that I'm using does not use the I/O scheduler.
> >
> > Isn't it equivalent to using No-op? If yes, then it should not be an
> > issue?
>
> No, it is not equivalent. When using devices drivers that provide their
> own make_request_fn() (check for devices that invoke
> blk_queue_make_request() at initialization time) bios entering the block
> layer can go directly to the device driver and from there to the device.
As Fernando said, that device driver invokes blk_queue_make_request(),
Thanks,
Ryo Tsuruta
Hi Vivek,
> > Thanks for your explanation.
> > I think that the same thing occurs without the higher level scheduler,
> > because all the tasks issuing I/Os are blocked while the underlying
> > device's request queue is full before those I/Os are sent to the I/O
> > scheduler.
> >
>
> True and this issue was pointed out by Divyesh. I think we shall have to
> fix this by allocating the request descriptors in proportion to their
> share. One possible way is to make use of elv_may_queue() to determine
> if we can allocate furhter request descriptors or not.
At the fist glance, elv_may_queue() seemed to be useful for the purpose
as you mentioned. But I've noticed there are some problems after I
investigated the code more.
1. Every I/O controller must have its own decision algorithm that
which I/O requests should be block or not, whose algorithm will be
similar to that of dm-ioband. It would be a hassle to implement it
in all the I/O controllers.
2. When an I/O is completed, one of the slots in the request queue
become available, then one of the processes being blocked get awakened
in fifo manner. This won't be the best one in most cases and
you have to make this process sleep again and you may want to
wake up another one. It's inefficient.
3. In elv_may_queue(), we can't determine which process issues an I/O.
You have no choice but to make any kind of process sleep even if
it's a kernel thread such as kswapd or pdflush. What do you think
is going to happen after that?
It may be possible to modify the code not to block kernel threads,
but I don't think you can control delayed-write I/Os.
If you want to solve these problems, I think you are going to implement
the algorithm there whose code is very similar to that of dm-ioband.
Thanks,
Ryo Tsuruta