On Tue, Jan 20, 2009 at 03:53:34PM +0000, Alasdair G Kergon wrote:
> So, what needs to be reviewed?
>
>
> 1. General style/layout/naming cleanup.
> - It's pretty good compared to a lot of patches that get sent, but there are
> still a few things we can improve.
>
> Lindent is throwing up some stuff (but remember it doesn't get things
> perfect so don't make all the changes it recommends).
> Remove double blank lines, plenty of unnecessary braces, "unsigned int" ->
> "unsigned".
> Review the names - should a few more things get a DM_ or dm_ prefix?
> Are all the names consistent and as self-explanatory as they can reasonably be?
> (e.g. a variable called 'new' - new what?)
>
>
> 2. A high-level review.
> Review the documentation and what the code does and how it does it.
> - Does it make sense to add this to the kernel in this way?
>
CCing lkml, containers mailing list and other folks who might be
interested in the thread.
It is a long mail. You have been warned. :-)
Here are some of my thoughts and also summary of some of the past
discussions about dm-ioband either on lkml or off lkml. Following is
one of the relevant link of past discussion on lkml about this.
http://lkml.org/lkml/2008/11/6/227
At this point of time looks like there are two schools of thought regarding
how IO controller should be implemented. The first one believes (dm-ioband)
that io controller should be implemented as 2 level approach where higher
level of IO control is done by this dm-ioband driver and lower level of
scheduling is done by elevator/iosched code (noop, deadline, AS and cfq).
Second school of thought (me, nauman from google and may be others) belive
that introducing another level of IO control at higher layer breaks the
assumptions of lower level scheduling hence we should be doing IO control
and IO scheduing both in single layer and that is at elevator layer.
Before I dive into details how assumptions are broken, let me discuss
requirment part a bit. We seem to be differing on requirement part also.
I think that we need IO control only at the point where real contetion
is and not on every logical block device where there is no real contetion
for the resource. Real contention for the resource is at the end node
physical device where device is slow and then arises the need of some kind
of resource control.
I am not very sure why dm-ioband folks want to enable IO control on any
xyz block device but in the past I got two responses.
1. Need to control end devices which don't have any elevator attached.
2. Need to do IO control for devices which are effectively network backed.
for example, an NFS mounted file loop mounted as a block device.
I don't fully understand the first requirement. Which are the device drviers
that don't use any of the standard ioschedulers? I am not aware of any
in-kernel drviers and I am assuming it will be binary drivers. If that's the
case, then those binary drviers need to be modified to take advantange of IO
control provided by elevator layer.
Regarding the second requirement I think this sounds more like a network
controller issue. Again the real contention is at network layer and not
at logical block device.
So at this point of time my understanding is that most common case for
IO resource control is at the end devices in the system and it can be
controlled by one level of IO control and scheudling. Please correct
me if that's not the case from requirement point of view.
Having said that even if we really find genuine cases where we need to
control IO on any xyz block device, then we should be able to come
up with generic IO controller which can reuse some of the code from 1
level controller. I am not against that and I think probably 1 level IO
controller and a generic IO controller can co-exist. But there are few points
which I find little odd about dm-ioband.
Why generic IO controller is not good for every case
====================================================
To my knowledge, there have been two generic controller implementations.
One is dm-ioband and other is an RFC patch by me. Following is the link.
http://lkml.org/lkml/2008/11/6/227
The biggest issue with generic controller is that they can buffer the
bio's at higher layer (once a cgroup is backed up) and then later release
those bios in FIFO manner. This can conflict with unerlying IO scheduler's
assumptions. Following example comes to mind.
- If there is one task of io priority 0 in a cgroup and rest of the tasks
are of io prio 7. All the tasks belong to best effort class. If tasks of
lower priority (7) do lot of IO, then due to buffering there is a chance
that IO from lower prio tasks is seen by CFQ first and io from higher prio
task is not seen by cfq for quite some time hence that task not getting it
fair share with in the cgroup. Similar situation can arise with RT tasks
also.
Some of the issues with dm-ioband implementation
===============================================
- Breaks the assumptions of underlying IO schedulers.
- There is no notion of task classes. So tasks of all the classes are
at same level from resource contention point of view. The only thing
which differentiates them is cgroup weight. Which does not answer the
question that an RT task or RT cgroup should starve the peer cgroup
if need be as RT cgroup should get priority access.
- Because of FIFO release of buffered bios, it is possible that task
of lower priority gets more IO done than the task of higher
priority.
- Task grouping logic
- We already have the notion of cgroup where tasks can be grouped
in hierarhical manner. dm-ioband does not make full use of that and
comes up with own mechansim of grouping tasks (apart from cgroup).
And there are odd ways of specifying cgroup id which configuring the
dm-ioband device. I think once somebody has created the cgroup
hieararchy, any IO controller logic should be able to internally
read that hiearchy and provide control. There should not be need
of any other configuration utity on top of cgroup.
My RFC patches had done that.
- Need of a dm device for every device we want to control
- This requirement looks odd. It forces everybody to use dm-tools
and if there are lots of disks in the system, configuation is
pain.
- Does it support hiearhical grouping?
- I have not looked very closely at dm-ioband patches about this and
had asked ryo a question about this (no response).
Ryo does, dm-ioband support hierarhical grouping configuration?
Summary
=======
- IMHO, for common case we don't need a generic IO controller and by
implementing an IO controller at elevator layer with close coupling
to io schedulers, we should be able to achive the goal.
Currently there is work in progress (off the list) by me, nauman, Fabio,
Paolo and others to implement a common IO control layer which can be
used by all the four IO schedulers without too much of code duplication.
Hopefully in next 2-3 weeks we should be able to post the initial patches
for RFC.
- Even if there are cases for controlling a xyz block device, we can have
a generic io controller also to cover that case. Ideally this controller
should not be used by devices which use standard io schedulers.
IMHO, dm-ioband as few odd points as mentioned above when it comes to
generic controller and I think those should be addressed if we can really
justify the need of a generic IO controller.
Your comments are welcome.
Thanks
Vivek
Hi Vivek,
Thanks for your comments.
> I am not very sure why dm-ioband folks want to enable IO control on any
> xyz block device but in the past I got two responses.
>
> 1. Need to control end devices which don't have any elevator attached.
> 2. Need to do IO control for devices which are effectively network backed.
> for example, an NFS mounted file loop mounted as a block device.
The two responses are issues of IO scheduler based controllers, not
reasons why we implement the IO controller as a device mapper driver.
The reasons of that are:
- A user have a choice whether to use dm-ioband or not, and dm-ioband
doesn't make any effects on the system if a user doesn't want to
use it.
- The dm device is highly independent module, so we don't need to modify
the existing kernel code including the IO schedulers. It can keep
the IO scheduler implementation simple.
So, dm-ioband can co-exist with any other IO controllers from a
user's and kernel developer's perspective.
> Why generic IO controller is not good for every case
> ====================================================
> To my knowledge, there have been two generic controller implementations.
> One is dm-ioband and other is an RFC patch by me. Following is the link.
>
> http://lkml.org/lkml/2008/11/6/227
>
> The biggest issue with generic controller is that they can buffer the
> bio's at higher layer (once a cgroup is backed up) and then later release
> those bios in FIFO manner. This can conflict with unerlying IO scheduler's
> assumptions. Following example comes to mind.
I don't think you are completely right.
> - If there is one task of io priority 0 in a cgroup and rest of the tasks
> are of io prio 7. All the tasks belong to best effort class. If tasks of
> lower priority (7) do lot of IO, then due to buffering there is a chance
> that IO from lower prio tasks is seen by CFQ first and io from higher prio
> task is not seen by cfq for quite some time hence that task not getting it
> fair share with in the cgroup. Similar situation can arise with RT tasks
> also.
Whether using dm-ioband or not, if the tasks of IO priority 7 do lot
of IO, then the device queue is going to be full and tasks which tries
to issue IOs are blocked until the queue get a slot. The IOs are
backlogged even if they are issued from the task of IO priority 0.
I don't understand why you think it's the biggest issue. The same
thing is going to happen without dm-ioband.
If I were you, I create two cgroups and let tasks of lower priority
belong to one cgroup and tasks of higher priority belong to another,
and give higher bandwidth to the cgroup to which the higher priority
tasks belong. What do you think about this way?
> - Task grouping logic
> - We already have the notion of cgroup where tasks can be grouped
> in hierarhical manner. dm-ioband does not make full use of that and
> comes up with own mechansim of grouping tasks (apart from cgroup).
> And there are odd ways of specifying cgroup id which configuring the
> dm-ioband device. I think once somebody has created the cgroup
> hieararchy, any IO controller logic should be able to internally
> read that hiearchy and provide control. There should not be need
> of any other configuration utity on top of cgroup.
>
> My RFC patches had done that.
Dm-ioband can work with the bio-cgroup mechanism, which makes task groups
in manner of the cgroup, of course.
I already have a basic design to make dm-ioband support the cgroup
hierarchy. This should be started after the core code of bio-cgroup,
which helps trace each I/O requests, is merged in -mm tree.
And the reason dm-ioband uses cgroup id to specify a cgroup is that
the current cgroup infrastructure lacks features to manage resources
placed in the kernel modules.
> - Need of a dm device for every device we want to control
>
> - This requirement looks odd. It forces everybody to use dm-tools
> and if there are lots of disks in the system, configuation is
> pain.
I don't think it's so pain. I think you are already using LVM devices on
your boxes. Setting up dm-ioband is the same as that for LVM. And some
scripts or something similar will help you set up them.
And it is also possible this algorithm can be directly implemented in the
block layer if this is really needed.
> - Does it support hiearhical grouping?
>
> - I have not looked very closely at dm-ioband patches about this and
> had asked ryo a question about this (no response).
>
> Ryo does, dm-ioband support hierarhical grouping configuration?
I'm sorry I missed your email with the question.
I already have a design plan for it and I will start to implement it
if there are a lot of requests for this. But I doubt this should be
implemented in kernel, which can be placed in user-land, such as
a daemon program.
Thanks,
Ryo Tsuruta
On Fri, Jan 23, 2009 at 07:14:04PM +0900, Ryo Tsuruta wrote:
> Hi Vivek,
>
> Thanks for your comments.
>
> > I am not very sure why dm-ioband folks want to enable IO control on any
> > xyz block device but in the past I got two responses.
> >
> > 1. Need to control end devices which don't have any elevator attached.
> > 2. Need to do IO control for devices which are effectively network backed.
> > for example, an NFS mounted file loop mounted as a block device.
>
> The two responses are issues of IO scheduler based controllers, not
> reasons why we implement the IO controller as a device mapper driver.
> The reasons of that are:
> - A user have a choice whether to use dm-ioband or not, and dm-ioband
> doesn't make any effects on the system if a user doesn't want to
> use it.
Even in in-kernel solution, cgroup code will be compiled out if user
is not using IO controller. Some code might still be present in run time
but I don't think it will be any big run time penalty.
> - The dm device is highly independent module, so we don't need to modify
> the existing kernel code including the IO schedulers. It can keep
> the IO scheduler implementation simple.
>
Agree that dm device is highly independent module but I think in this it
does not look like the right place to implement the IO controller.
I think with the introduction of cgroup, IO scheduling has become now
a hierarhical scheduling. Previously it was flat scheduling where there
was only one level. Not there can be multiple levels and each level
can have groups and queues. I don't think that we can break down
hiearchical scheduling problem in two parts where top level part is moved
into a module. It is something like saying that lets break out cpu group
schedling into a separate module and it should not be part of kernel.
I think we need to implement this hiearchical IO scheduler in kernel which
can schedule groups as well as end level io queues. (maintained by cfq,
deadline, as, or noop).
> So, dm-ioband can co-exist with any other IO controllers from a
> user's and kernel developer's perspective.
Just because device mapper framework allows one to implement IO controller
in a separate module, we should not implement it there. It will be
difficult to take care of issues like, configuration, breaking underlying IO
scheduler's assumptions, capability to treat tasks and groups at same level
etc.
>
> > Why generic IO controller is not good for every case
> > ====================================================
> > To my knowledge, there have been two generic controller implementations.
> > One is dm-ioband and other is an RFC patch by me. Following is the link.
> >
> > http://lkml.org/lkml/2008/11/6/227
> >
> > The biggest issue with generic controller is that they can buffer the
> > bio's at higher layer (once a cgroup is backed up) and then later release
> > those bios in FIFO manner. This can conflict with unerlying IO scheduler's
> > assumptions. Following example comes to mind.
>
> I don't think you are completely right.
>
> > - If there is one task of io priority 0 in a cgroup and rest of the tasks
> > are of io prio 7. All the tasks belong to best effort class. If tasks of
> > lower priority (7) do lot of IO, then due to buffering there is a chance
> > that IO from lower prio tasks is seen by CFQ first and io from higher prio
> > task is not seen by cfq for quite some time hence that task not getting it
> > fair share with in the cgroup. Similar situation can arise with RT tasks
> > also.
>
> Whether using dm-ioband or not, if the tasks of IO priority 7 do lot
> of IO, then the device queue is going to be full and tasks which tries
> to issue IOs are blocked until the queue get a slot. The IOs are
> backlogged even if they are issued from the task of IO priority 0.
> I don't understand why you think it's the biggest issue. The same
> thing is going to happen without dm-ioband.
>
True that even limited availability of request descriptors can be a
bottleneck and can lead to same kind of issues but my contention is
that you are aggravating the problem. Putting a 2nd layer can break IO
scheduler's assumption even before underlying request queue is full.
So second level solution on top will increase the frequency of such
incidents where a lower priority task can run away with more job done than
high priority task because there are no separate queues for different
priority tasks and release of buffered bio is FIFO.
Secondly what happens to tasks of RT class? dm-ioband does not have any
notion of handling the RT cgroup or RT tasks.
Thirdly, doing any kind of resource control at higher level takes away the
capability to treat task and groups at same level. I have had this
discussion in other offline thread also where you are copied. I think
it is a good idea to treat tasks and groups at same level where possible
(depends if IO scheduler creates separate queues for tasks or not, cfq
does.)
> If I were you, I create two cgroups and let tasks of lower priority
> belong to one cgroup and tasks of higher priority belong to another,
> and give higher bandwidth to the cgroup to which the higher priority
> tasks belong. What do you think about this way?
>
I think this is not practical. What we are talking is that task
priority does not have any meaning. If we want service difference between
two tasks, we need to pack them in separate cgroup otherwise we can't
gurantee things. If we need to pack every task in separate cgroup then
why to even have the notion of task priority.
> > - Task grouping logic
> > - We already have the notion of cgroup where tasks can be grouped
> > in hierarhical manner. dm-ioband does not make full use of that and
> > comes up with own mechansim of grouping tasks (apart from cgroup).
> > And there are odd ways of specifying cgroup id which configuring the
> > dm-ioband device. I think once somebody has created the cgroup
> > hieararchy, any IO controller logic should be able to internally
> > read that hiearchy and provide control. There should not be need
> > of any other configuration utity on top of cgroup.
> >
> > My RFC patches had done that.
>
> Dm-ioband can work with the bio-cgroup mechanism, which makes task groups
> in manner of the cgroup, of course.
> I already have a basic design to make dm-ioband support the cgroup
> hierarchy. This should be started after the core code of bio-cgroup,
> which helps trace each I/O requests, is merged in -mm tree.
>
bio-cgroup patches are fine because they provide us the capability to
map delayed writes to right cgroup. And it can be used by any IO
controller.
> And the reason dm-ioband uses cgroup id to specify a cgroup is that
> the current cgroup infrastructure lacks features to manage resources
> placed in the kernel modules.
Can you elaborate on that please? We have heard in the past that cgroup
does not give you enough flexibility but never got details.
In this case first you are forcing some functionalilty to go in a kernel
module and then coming up with tools for configuration. I never understood
that why don't you let the controller be inside the kernel, let it
directly interact with cgroup subsystem and work instead of first taking
the functionality out of kernel in a module and then justifying the case
that now we need new ways of configuring that module because cgroup
infrastructure is not sufficient.
>
> > - Need of a dm device for every device we want to control
> >
> > - This requirement looks odd. It forces everybody to use dm-tools
> > and if there are lots of disks in the system, configuation is
> > pain.
>
> I don't think it's so pain. I think you are already using LVM devices on
> your boxes. Setting up dm-ioband is the same as that for LVM. And some
> scripts or something similar will help you set up them.
>
Not everybody uses LVM. Balbir had asked once, if there are thousands of
disks in the system, does that mean I need to create this dm-ioband device
for all the disks?
> And it is also possible this algorithm can be directly implemented in the
> block layer if this is really needed.
>
> > - Does it support hiearhical grouping?
> >
> > - I have not looked very closely at dm-ioband patches about this and
> > had asked ryo a question about this (no response).
> >
> > Ryo does, dm-ioband support hierarhical grouping configuration?
>
> I'm sorry I missed your email with the question.
> I already have a design plan for it and I will start to implement it
> if there are a lot of requests for this. But I doubt this should be
> implemented in kernel, which can be placed in user-land, such as
> a daemon program.
>
We do need hierarhical grouping facility for IO controller also.
I am not sure that I agree to the idea of implementing IO controller
as a device mapper driver because device mapper framework allows it to
be implemented as a module and one can avoid putting code in kernel. At
this point of time, IMHO, I don't think that IO controller code living
inside the kernel is an issue. I would rather focus on rest of the issues.
Thanks
Vivek
Hi Vivek,
I split this mail thread into three topics:
o 2-Level IO scheduling
o Hierarchical grouping facility for IO controller
o Implement IO controller as a dm-driver
This mail is about 2-Level IO scheduling.
> Just because device mapper framework allows one to implement IO controller
> in a separate module, we should not implement it there. It will be
> difficult to take care of issues like, configuration, breaking underlying IO
> scheduler's assumptions, capability to treat tasks and groups at same level
> etc.
If you are satisfied with low-accuracy bandwidth control by an IO
scheduler, you don't need to use dm-ioband. If you want to use
dm-ioband with an IO scheduler, dm-ioband can work with any type of IO
scheduler, of course dm-ioband can work with your own IO scheduler
which you are developing.
> > > - If there is one task of io priority 0 in a cgroup and rest of the tasks
> > > are of io prio 7. All the tasks belong to best effort class. If tasks of
> > > lower priority (7) do lot of IO, then due to buffering there is a chance
> > > that IO from lower prio tasks is seen by CFQ first and io from higher prio
> > > task is not seen by cfq for quite some time hence that task not getting it
> > > fair share with in the cgroup. Similar situation can arise with RT tasks
> > > also.
> >
> > Whether using dm-ioband or not, if the tasks of IO priority 7 do lot
> > of IO, then the device queue is going to be full and tasks which tries
> > to issue IOs are blocked until the queue get a slot. The IOs are
> > backlogged even if they are issued from the task of IO priority 0.
> > I don't understand why you think it's the biggest issue. The same
> > thing is going to happen without dm-ioband.
> >
>
> True that even limited availability of request descriptors can be a
> bottleneck and can lead to same kind of issues but my contention is
> that you are aggravating the problem. Putting a 2nd layer can break IO
> scheduler's assumption even before underlying request queue is full.
I don't think so. Dm-ioband doesn't break IO scheduler's assumptions.
In CFQ's case, the priority order is not changed within a cgroup.
> So second level solution on top will increase the frequency of such
> incidents where a lower priority task can run away with more job done than
> high priority task because there are no separate queues for different
> priority tasks and release of buffered bio is FIFO.
>
> Secondly what happens to tasks of RT class? dm-ioband does not have any
> notion of handling the RT cgroup or RT tasks.
It's not an issue, it's a talk about how to determine a policy.
I think giving priority to cgroup policy rather than I/O scheduler
policy is more flexible.
> Thirdly, doing any kind of resource control at higher level takes away the
> capability to treat task and groups at same level. I have had this
> discussion in other offline thread also where you are copied. I think
> it is a good idea to treat tasks and groups at same level where possible
> (depends if IO scheduler creates separate queues for tasks or not, cfq
> does.)
>
> > If I were you, I create two cgroups and let tasks of lower priority
> > belong to one cgroup and tasks of higher priority belong to another,
> > and give higher bandwidth to the cgroup to which the higher priority
> > tasks belong. What do you think about this way?
>
> I think this is not practical. What we are talking is that task
> priority does not have any meaning. If we want service difference between
> two tasks, we need to pack them in separate cgroup otherwise we can't
> gurantee things. If we need to pack every task in separate cgroup then
> why to even have the notion of task priority.
It is possible to modify dm-ioband to cooperate with CFQ, but I'm not
sure it's really meaningful. What do you do when a task of RT class
issues a lot of I/O? Do you always give priority to the I/Os from the
task of RT class despite of the assigned bandwidth? Which one do you
give priority bandwidth or RT class?
Thanks,
Ryo Tsuruta
Hi Vivek,
This mail is about hierarchical grouping facility for IO controller.
> I think with the introduction of cgroup, IO scheduling has become now
> a hierarhical scheduling. Previously it was flat scheduling where there
> was only one level. Not there can be multiple levels and each level
> can have groups and queues. I don't think that we can break down
> hiearchical scheduling problem in two parts where top level part is moved
> into a module. It is something like saying that lets break out cpu group
> schedling into a separate module and it should not be part of kernel.
>
> I think we need to implement this hiearchical IO scheduler in kernel which
> can schedule groups as well as end level io queues. (maintained by cfq,
> deadline, as, or noop).
I can implement the hierarchical grouping facility if really necessary
and the patch will be released after dm-ioband is merged into the
kernel. But do you believe the hierarchical grouping facility should
be implemented in the kernel, even if we can do it in the userland?
I think disk bandwidth control doesn't need responsiveness like the
CPU scheduler. I know it's possible to implement in the kernel, but
does it only make the kernel complex?
> > > - Task grouping logic
> > > - We already have the notion of cgroup where tasks can be grouped
> > > in hierarhical manner. dm-ioband does not make full use of that and
> > > comes up with own mechansim of grouping tasks (apart from cgroup).
> > > And there are odd ways of specifying cgroup id which configuring the
> > > dm-ioband device. I think once somebody has created the cgroup
> > > hieararchy, any IO controller logic should be able to internally
> > > read that hiearchy and provide control. There should not be need
> > > of any other configuration utity on top of cgroup.
> > >
> > > My RFC patches had done that.
> >
> > Dm-ioband can work with the bio-cgroup mechanism, which makes task groups
> > in manner of the cgroup, of course.
> > I already have a basic design to make dm-ioband support the cgroup
> > hierarchy. This should be started after the core code of bio-cgroup,
> > which helps trace each I/O requests, is merged in -mm tree.
>
> bio-cgroup patches are fine because they provide us the capability to
> map delayed writes to right cgroup. And it can be used by any IO
> controller.
>
> > And the reason dm-ioband uses cgroup id to specify a cgroup is that
> > the current cgroup infrastructure lacks features to manage resources
> > placed in the kernel modules.
>
> Can you elaborate on that please? We have heard in the past that cgroup
> does not give you enough flexibility but never got details.
The current cgroup framework can't manage resources dynamically. The
reason for using a cgroup ID to specify a cgroup is that it makes the
implementation quite simple. Although it is possible to implement the
function without using the ID, it makes the kernel complex due to the
function has to be implemented outside of the kernel.
Thanks,
Ryo Tsuruta
Hi Vivek,
This mail is about implement IO controller as a dm-driver.
> In this case first you are forcing some functionalilty to go in a kernel
> module and then coming up with tools for configuration. I never understood
> that why don't you let the controller be inside the kernel, let it
> directly interact with cgroup subsystem and work instead of first taking
> the functionality out of kernel in a module and then justifying the case
> that now we need new ways of configuring that module because cgroup
> infrastructure is not sufficient.
It is possible the algorithm of dm-ioband can be directly implemented
in the kernel. I've been investigating how to do it.
> > > - Need of a dm device for every device we want to control
> > >
> > > - This requirement looks odd. It forces everybody to use dm-tools
> > > and if there are lots of disks in the system, configuation is
> > > pain.
> >
> > I don't think it's so pain. I think you are already using LVM devices on
> > your boxes. Setting up dm-ioband is the same as that for LVM. And some
> > scripts or something similar will help you set up them.
>
> Not everybody uses LVM. Balbir had asked once, if there are thousands of
> disks in the system, does that mean I need to create this dm-ioband device
> for all the disks?
I think it could be easily done by a small script of several lines.
Thanks,
Ryo Tsuruta