LinuxLists.cc - Bandwidth Allocations under CFQ I/O Scheduler

2006-10-16 20:46:48

by Phetteplace, Thad (GE Healthcare, consultant)

Subject: Bandwidth Allocations under CFQ I/O Scheduler

The I/O priority levels available under the CFQ scheduler are
nice (no pun in intended), but I remember some talk back when
they first went in that future versions might include bandwidth
allocations in addition to the 'niceness' style. Is anyone out
there working on that? If not, I'm willing to hack up a proof
of concept... I just wan't to make sure I'm not reinventing
the wheel.

Thanks,

Thad Phetteplace

2006-10-17 01:24:38

by Arjan van de Ven

[permalink] [raw]

Subject: Re: Bandwidth Allocations under CFQ I/O Scheduler

On Mon, 2006-10-16 at 16:46 -0400, Phetteplace, Thad (GE Healthcare,
consultant) wrote:
> The I/O priority levels available under the CFQ scheduler are
> nice (no pun in intended), but I remember some talk back when
> they first went in that future versions might include bandwidth
> allocations in addition to the 'niceness' style. Is anyone out
> there working on that? If not, I'm willing to hack up a proof
> of concept... I just wan't to make sure I'm not reinventing
> the wheel.

Hi,

it's a nice idea in theory. However... since IO bandwidth for seeks is
about 1% to 3% of that of sequential IO (on disks at least), which
bandwidth do you want to allocate? "worst case" you need to use the
all-seeks bandwidth, but that's so far away from "best case" that it may
well not be relevant in practice. Yet there are real world cases where
for a period of time you approach worst case behavior ;(

Greetings,
Arjan van de Ven
--
if you want to mail me at work (you don't), use arjan (at) linux.intel.com
Test the interaction between Linux and your BIOS via http://www.linuxfirmwarekit.org

2006-10-17 13:22:38

by Jens Axboe

[permalink] [raw]

Subject: Re: Bandwidth Allocations under CFQ I/O Scheduler

On Tue, Oct 17 2006, Arjan van de Ven wrote:
> On Mon, 2006-10-16 at 16:46 -0400, Phetteplace, Thad (GE Healthcare,
> consultant) wrote:
> > The I/O priority levels available under the CFQ scheduler are
> > nice (no pun in intended), but I remember some talk back when
> > they first went in that future versions might include bandwidth
> > allocations in addition to the 'niceness' style. Is anyone out
> > there working on that? If not, I'm willing to hack up a proof
> > of concept... I just wan't to make sure I'm not reinventing
> > the wheel.
>
>
> Hi,
>
> it's a nice idea in theory. However... since IO bandwidth for seeks is
> about 1% to 3% of that of sequential IO (on disks at least), which
> bandwidth do you want to allocate? "worst case" you need to use the
> all-seeks bandwidth, but that's so far away from "best case" that it may
> well not be relevant in practice. Yet there are real world cases where
> for a period of time you approach worst case behavior ;(

Bandwidth reservation would have to be confined to special cases, you
obviously cannot do it "in general" for the reasons Arjan lists above.
So you absolutely have to limit any meta data io that would cause seeks,
and the file in question would have to be laid out in a closely
sequential fashion. As long as the access pattern generated by the app
asking for reservation is largely sequential, the kernel can do whatever
it needs to help you maintain the required bandwidth.

On a per-file basis the bandwidth reservation should be doable, to the
extent that generic hardware allows.

--
Jens Axboe

2006-10-17 14:37:57

by Ric Wheeler

[permalink] [raw]

Subject: Re: Bandwidth Allocations under CFQ I/O Scheduler

Jens Axboe wrote:
> On Tue, Oct 17 2006, Arjan van de Ven wrote:
>
>>On Mon, 2006-10-16 at 16:46 -0400, Phetteplace, Thad (GE Healthcare,
>>consultant) wrote:
>>
>>>The I/O priority levels available under the CFQ scheduler are
>>>nice (no pun in intended), but I remember some talk back when
>>>they first went in that future versions might include bandwidth
>>>allocations in addition to the 'niceness' style. Is anyone out
>>>there working on that? If not, I'm willing to hack up a proof
>>>of concept... I just wan't to make sure I'm not reinventing
>>>the wheel.
>>
>>
>>Hi,
>>
>>it's a nice idea in theory. However... since IO bandwidth for seeks is
>>about 1% to 3% of that of sequential IO (on disks at least), which
>>bandwidth do you want to allocate? "worst case" you need to use the
>>all-seeks bandwidth, but that's so far away from "best case" that it may
>>well not be relevant in practice. Yet there are real world cases where
>>for a period of time you approach worst case behavior ;(
>
>
> Bandwidth reservation would have to be confined to special cases, you
> obviously cannot do it "in general" for the reasons Arjan lists above.
> So you absolutely have to limit any meta data io that would cause seeks,
> and the file in question would have to be laid out in a closely
> sequential fashion. As long as the access pattern generated by the app
> asking for reservation is largely sequential, the kernel can do whatever
> it needs to help you maintain the required bandwidth.
>
> On a per-file basis the bandwidth reservation should be doable, to the
> extent that generic hardware allows.

I agree - bandwidth allocation is really tricky to do in a useful way.

On one hand, you could "time slice" the disk with some large quanta as
we would do with a CPU to get some reasonably useful allocation for
competing, streaming workloads.

On the other hand, this kind of thing would kill latency if/when you hit
any synchronous writes (or cold reads).

One other possible use for allocation is throttling a background
workload (say, an interative checker for a file system or some such
thing) where the workload can run effectively forever, but should be
contained to not interfere with foreground workloads. A similar time
slice might be used to throttle this load done unless there is no
competing work to be done.

2006-10-17 14:46:35

by Jens Axboe

[permalink] [raw]

Subject: Re: Bandwidth Allocations under CFQ I/O Scheduler

On Tue, Oct 17 2006, Ric Wheeler wrote:
> Jens Axboe wrote:
> >On Tue, Oct 17 2006, Arjan van de Ven wrote:
> >
> >>On Mon, 2006-10-16 at 16:46 -0400, Phetteplace, Thad (GE Healthcare,
> >>consultant) wrote:
> >>
> >>>The I/O priority levels available under the CFQ scheduler are
> >>>nice (no pun in intended), but I remember some talk back when
> >>>they first went in that future versions might include bandwidth
> >>>allocations in addition to the 'niceness' style. Is anyone out
> >>>there working on that? If not, I'm willing to hack up a proof
> >>>of concept... I just wan't to make sure I'm not reinventing
> >>>the wheel.
> >>
> >>
> >>Hi,
> >>
> >>it's a nice idea in theory. However... since IO bandwidth for seeks is
> >>about 1% to 3% of that of sequential IO (on disks at least), which
> >>bandwidth do you want to allocate? "worst case" you need to use the
> >>all-seeks bandwidth, but that's so far away from "best case" that it may
> >>well not be relevant in practice. Yet there are real world cases where
> >>for a period of time you approach worst case behavior ;(
> >
> >
> >Bandwidth reservation would have to be confined to special cases, you
> >obviously cannot do it "in general" for the reasons Arjan lists above.
> >So you absolutely have to limit any meta data io that would cause seeks,
> >and the file in question would have to be laid out in a closely
> >sequential fashion. As long as the access pattern generated by the app
> >asking for reservation is largely sequential, the kernel can do whatever
> >it needs to help you maintain the required bandwidth.
> >
> >On a per-file basis the bandwidth reservation should be doable, to the
> >extent that generic hardware allows.
>
> I agree - bandwidth allocation is really tricky to do in a useful way.
>
> On one hand, you could "time slice" the disk with some large quanta as
> we would do with a CPU to get some reasonably useful allocation for
> competing, streaming workloads.
>
> On the other hand, this kind of thing would kill latency if/when you hit
> any synchronous writes (or cold reads).

That's pretty close to the way that CFQ already operates. You need time
slices long enough to make the initial seek neglible, but short enough
to make the latencies nice. A tradeoff, of course.

> One other possible use for allocation is throttling a background
> workload (say, an interative checker for a file system or some such
> thing) where the workload can run effectively forever, but should be
> contained to not interfere with foreground workloads. A similar time
> slice might be used to throttle this load done unless there is no
> competing work to be done.

That'd be the idle io class.

--
Jens Axboe

2006-10-17 14:47:04

by Phetteplace, Thad (GE Healthcare, consultant)

[permalink] [raw]

Subject: RE: Bandwidth Allocations under CFQ I/O Scheduler

Jens Axboe wrote:
> Arjan van de Ven wrote:
> >
> > it's a nice idea in theory. However... since IO bandwidth for seeks
is
> > about 1% to 3% of that of sequential IO (on disks at least), which
> > bandwidth do you want to allocate? "worst case" you need to use the
> > all-seeks bandwidth, but that's so far away from "best case" that it

> > may well not be relevant in practice. Yet there are real world cases

> > where for a period of time you approach worst case behavior ;(
>
> Bandwidth reservation would have to be confined to special cases, you
> obviously cannot do it "in general" for the reasons Arjan lists above.
> So you absolutely have to limit any meta data io that would cause
seeks,
> and the file in question would have to be laid out in a closely
> sequential fashion. As long as the access pattern generated by the app
> asking for reservation is largely sequential, the kernel can do
whatever
> it needs to help you maintain the required bandwidth.
>
> On a per-file basis the bandwidth reservation should be doable, to the
> extent that generic hardware allows.

I see bandwidth allocations coming in two flavors: floors and ceilings.
Floors (a guaranteed minimum) are indeed problematic because of the
danger of over-allocating bandwidth. Seek latency reducing your total
available bandwidth in non-deterministic ways only complicates the
issue.
Ceilings are easier, as we are simply capping utilization even when
excess capacity is available. Of course floors is probably what most
people are thinking of when they talk about allocations, but ceilings
have their place also. In an embedded environment where very
deterministic behavior is the goal, I/O ceilings could be useful. Also,
it could be useful for emulation of legacy hardware performance, perhaps
for regression testing or some such (admittedly an edge case).

If you over allocate bandwidth on a resource, the bandwidth allocation
would probably fall back to something more like the 'niceness' model
(with the higher bandwidth procs running with higher priority). The
only real change then is the enforcing of bandwidth ceilings. This
is probably not very useful in the general case (your main operating
system drives with many users/processes reading and writing), but it
can be very useful for managing the behavior of a limited set of apps
with excusive access to a drive.

There is a body of knowledge in the ISP/routing world we can draw on
here, though they don't have the same latency issues.

Later,

Thad Phetteplace

2006-10-18 08:00:30

by Jakob Oestergaard

[permalink] [raw]

Subject: Re: Bandwidth Allocations under CFQ I/O Scheduler

On Tue, Oct 17, 2006 at 03:23:13PM +0200, Jens Axboe wrote:
> On Tue, Oct 17 2006, Arjan van de Ven wrote:
...
> > Hi,
> >
> > it's a nice idea in theory. However... since IO bandwidth for seeks is
> > about 1% to 3% of that of sequential IO (on disks at least), which
> > bandwidth do you want to allocate? "worst case" you need to use the
> > all-seeks bandwidth, but that's so far away from "best case" that it may
> > well not be relevant in practice. Yet there are real world cases where
> > for a period of time you approach worst case behavior ;(
>
> Bandwidth reservation would have to be confined to special cases, you
> obviously cannot do it "in general" for the reasons Arjan lists above.

How about allocating I/O operations instead of bandwidth ?

So, any read is really a seek+read, and we count that as one I/O
operation. Same for writes.

Since the total "capacity" of the system is typically (in real-world
scenarios) the number of operations (seek+X) rather than the raw
sequential bandwidth anyway, I suppose that I/O operations would be what
you wanted to allocate anyway.

Anyway, just a thought...

(And if you're thinking one sequential reader/writer could then starve
the system; well, count every 256KiB of data to read/write as a seperate
I/O operation even though no seek is needed. That would very roughly
match the raw read/write performance with the seek performance)

--

/ jakob

2006-10-18 09:41:08

by Arjan van de Ven

[permalink] [raw]

Subject: Re: Bandwidth Allocations under CFQ I/O Scheduler

On Wed, 2006-10-18 at 10:00 +0200, Jakob Oestergaard wrote:
> On Tue, Oct 17, 2006 at 03:23:13PM +0200, Jens Axboe wrote:
> > On Tue, Oct 17 2006, Arjan van de Ven wrote:
> ...
> > > Hi,
> > >
> > > it's a nice idea in theory. However... since IO bandwidth for seeks is
> > > about 1% to 3% of that of sequential IO (on disks at least), which
> > > bandwidth do you want to allocate? "worst case" you need to use the
> > > all-seeks bandwidth, but that's so far away from "best case" that it may
> > > well not be relevant in practice. Yet there are real world cases where
> > > for a period of time you approach worst case behavior ;(
> >
> > Bandwidth reservation would have to be confined to special cases, you
> > obviously cannot do it "in general" for the reasons Arjan lists above.
>
> How about allocating I/O operations instead of bandwidth ?
>
> So, any read is really a seek+read, and we count that as one I/O
> operation. Same for writes.

Hi,

I can see that that makes it simple, but.. what would it MEAN? Eg what
would a system administrator use it for? It then no longer means "my mp3
player is guaranteed to get the streaming mp3 from the disk at this
bitrate" or something like that... so my question to you is: can you
describe what it'd bring the admin to put such an allocation in place?
If we find that it can be a good approach.. but if not, I'm less certain
this'll be used..

Greetings,
Arjan van de Ven

2006-10-18 09:50:47

by Jens Axboe

[permalink] [raw]

Subject: Re: Bandwidth Allocations under CFQ I/O Scheduler

On Wed, Oct 18 2006, Jakob Oestergaard wrote:
> On Tue, Oct 17, 2006 at 03:23:13PM +0200, Jens Axboe wrote:
> > On Tue, Oct 17 2006, Arjan van de Ven wrote:
> ...
> > > Hi,
> > >
> > > it's a nice idea in theory. However... since IO bandwidth for seeks is
> > > about 1% to 3% of that of sequential IO (on disks at least), which
> > > bandwidth do you want to allocate? "worst case" you need to use the
> > > all-seeks bandwidth, but that's so far away from "best case" that it may
> > > well not be relevant in practice. Yet there are real world cases where
> > > for a period of time you approach worst case behavior ;(
> >
> > Bandwidth reservation would have to be confined to special cases, you
> > obviously cannot do it "in general" for the reasons Arjan lists above.
>
> How about allocating I/O operations instead of bandwidth ?
>
> So, any read is really a seek+read, and we count that as one I/O
> operation. Same for writes.
>
> Since the total "capacity" of the system is typically (in real-world
> scenarios) the number of operations (seek+X) rather than the raw
> sequential bandwidth anyway, I suppose that I/O operations would be what
> you wanted to allocate anyway.
>
> Anyway, just a thought...

While that may make some sense internally, the exported interface would
never be workable like that. It needs to be simple, "give me foo kb/sec
with max latency bar for this file", with an access pattern or assumed
sequential io.

Nobody speaks of iops/sec except some silly benchmark programs. I know
that you are describing pseudo-iops, but it still doesn't make it more
clear.
Things aren't as simple

--
Jens Axboe

2006-10-18 11:04:06

by Helge Hafting

[permalink] [raw]

Subject: Re: Bandwidth Allocations under CFQ I/O Scheduler

Jens Axboe wrote:
> While that may make some sense internally, the exported interface would
> never be workable like that. It needs to be simple, "give me foo kb/sec
> with max latency bar for this file", with an access pattern or assumed
> sequential io.
>
> Nobody speaks of iops/sec except some silly benchmark programs. I know
> that you are describing pseudo-iops, but it still doesn't make it more
> clear.
> Things aren't as simple
>
How about "give me 10% of total io capacity?" People understand
this, and the io scheduler can then guarantee this by ensuring
that the process gets 1 out of 10 io requests as long as it
keeps submitting enough.

The admin can then set a reasonable percentage depending on
the machine's capacity.

Helge Hafting

2006-10-18 11:13:29

by Jens Axboe

[permalink] [raw]

Subject: Re: Bandwidth Allocations under CFQ I/O Scheduler

On Wed, Oct 18 2006, Helge Hafting wrote:
> Jens Axboe wrote:
> >While that may make some sense internally, the exported interface would
> >never be workable like that. It needs to be simple, "give me foo kb/sec
> >with max latency bar for this file", with an access pattern or assumed
> >sequential io.
> >
> >Nobody speaks of iops/sec except some silly benchmark programs. I know
> >that you are describing pseudo-iops, but it still doesn't make it more
> >clear.
> >Things aren't as simple
> >
> How about "give me 10% of total io capacity?" People understand
> this, and the io scheduler can then guarantee this by ensuring
> that the process gets 1 out of 10 io requests as long as it
> keeps submitting enough.

The thing about disks is that it's not as easy as giving the process 10%
of the io requests issued. Only if the considered bandwidth is random
load will that work, but that's not very interesting.

You need to say 10% of the disk time, which is something CFQ can very
easily be modified to do since it works with time slices already. 10%
doesn't mean very much though, you need a timeframe for that to make
sense anyways. Give me 100msec every 1000msecs makes more sense.

--
Jens Axboe

2006-10-18 11:24:36

by Ric Wheeler

[permalink] [raw]

Subject: Re: Bandwidth Allocations under CFQ I/O Scheduler

Helge Hafting wrote:

> Jens Axboe wrote:
>
>> While that may make some sense internally, the exported interface would
>> never be workable like that. It needs to be simple, "give me foo kb/sec
>> with max latency bar for this file", with an access pattern or assumed
>> sequential io.
>>
>> Nobody speaks of iops/sec except some silly benchmark programs. I know
>> that you are describing pseudo-iops, but it still doesn't make it more
>> clear.
>> Things aren't as simple
>>
>
> How about "give me 10% of total io capacity?" People understand
> this, and the io scheduler can then guarantee this by ensuring
> that the process gets 1 out of 10 io requests as long as it
> keeps submitting enough.
>
> The admin can then set a reasonable percentage depending on
> the machine's capacity.
>
> Helge Hafting

The tricky part is that when you mix up workloads, you blow the drive's
ability to minimize head seek & rotational latency. For example, I have
measured almost a 10x decrease when I mix one serious workload (reading
each file in a large file system as fast as you can) with a moderate
write workload.

All a long winded way of saying that what we might be able to do in the
worst case is to give an even portion of that worst case IO capability
which is itself only 10% of the best case (i.e., 1% of the non-shared
best case) ;-)

Some of the high ends arrays (like the EMC Symmetrix, IBM Shark, Hitachi
boxes, etc) are much better at this sharing since they have massive
amounts of nonvolatile DRAM & lots of algorithmic ability to tease apart
individual streams internally. Note that they have to do this since
they are connected up to many different hosts.

It might be interesting to thinking about how we would tweak things for
this specific class of arrays as a special case,

ric

2006-10-18 11:30:01

by Jakob Oestergaard

[permalink] [raw]

Subject: Re: Bandwidth Allocations under CFQ I/O Scheduler

On Wed, Oct 18, 2006 at 11:40:56AM +0200, Arjan van de Ven wrote:
...
> Hi,
>
> I can see that that makes it simple, but.. what would it MEAN? Eg what
> would a system administrator use it for?

For example, I could allocate "at least 100 iops/sec" for my database.
The VMWare can take whatever is left.

I have no idea how much bandwidth my database needs... But I have a
rough idea about how many I/O operations it does for a given operation.
And if I don't, strace can tell me pretty quick :)

> It then no longer means "my mp3
> player is guaranteed to get the streaming mp3 from the disk at this
> bitrate" or something like that...

In a sense you are right.

You cannot be certain that the mp3 player will get a specific bandwidth.
The mp3 player will be accessing the underlying storage through a
filesystem, which again means that accessing a file sequentially *will*
cause non-sequential I/O on the underlying device(s).

If you wanted to guarantee any specific bandwidth, you would somehow
assume that you had an infinite (or at least very very high) number of
seeks at your disposal. Or that seeks were free... In any other
scenario, the total "capacity" of your underlying storage, the maximum
amount of bandwidth (including non-free seeks) available, would vary
depending on how it is currently used (how many seeks are issued) by all
the clients.

So, what I'm arguing is; you will not want to specify a fixed sequential
bandwidth for your mp3 player.

What you want to do is this: Allocate 5 iops/sec for your mp3 player
because either a quick calculation - or - experience has shown that this
is enough for it to keep its buffer from depleting at all times.

Describing iops/sec for your mp3 player is at least as simple as
sequential bitrate. The difference is, that you can implement iops/sec
allocation whereas you cannot implement bitrate allocation (in a
meaningful way at least) :)

> so my question to you is: can you
> describe what it'd bring the admin to put such an allocation in place?

Limiting on iops/sec rather than bandwidth, is simply accepting that
bandwidth does not make sense (because you cannot know how much of it
you have and therefore you cannot slice up your total capacity), and,
realizing that bandwidth in the scenarios where limiting is interesting
is in reality bound by seeks rather than sequential on-disk throughput.

> If we find that it can be a good approach.. but if not, I'm less certain
> this'll be used..

I can only see a problem with specifying iops/sec in the one scenario
where you have multiple sequential readers or writers, and you want to
distribute bandwidth between them.

However, in that scenario, where you have multiple clients, *seeks* will
again be your limiting factor.

Specifying iops/sec might be difficult for the admin. But I really can't
see how you would implement bandwidth limiting in a meaningful way - and
if you can't do that, then specifying bandwidth limiting in terms of a
bandwidth limiting process that doesn't work properly will be even
harder :)

The only situation in which seeks is not either the limiting factor, or
at least a very very large contributor to I/O wait, is in the situation
where you have only one client. And, if you have only one client, what
is it you need sharing of resources for again?

In all other scenarios, I believe iops/sec is by far a superios way of
describing the ressource allocation. For two reasons:
1) It describes what the hardware provides
2) By describing a concept based on the real world it may actually be
possible to implement so that it works as intended

I hope some of the above makes sense. I'll try to explain what I mean to
the best of my ability :)

--

/ jakob

2006-10-18 11:48:36

by Jens Axboe

[permalink] [raw]

Subject: Re: Bandwidth Allocations under CFQ I/O Scheduler

On Wed, Oct 18 2006, Jakob Oestergaard wrote:
> On Wed, Oct 18, 2006 at 11:40:56AM +0200, Arjan van de Ven wrote:
> ...
> > Hi,
> >
> > I can see that that makes it simple, but.. what would it MEAN? Eg what
> > would a system administrator use it for?
>
> For example, I could allocate "at least 100 iops/sec" for my database.
> The VMWare can take whatever is left.
>
> I have no idea how much bandwidth my database needs... But I have a
> rough idea about how many I/O operations it does for a given operation.
> And if I don't, strace can tell me pretty quick :)

That's crazy. So you want a user of this to strace and write a script
parsing strace output to tell you possibly how many iops/sec you need?

> > It then no longer means "my mp3
> > player is guaranteed to get the streaming mp3 from the disk at this
> > bitrate" or something like that...
>
> In a sense you are right.
>
> You cannot be certain that the mp3 player will get a specific bandwidth.
> The mp3 player will be accessing the underlying storage through a
> filesystem, which again means that accessing a file sequentially *will*
> cause non-sequential I/O on the underlying device(s).
>
> If you wanted to guarantee any specific bandwidth, you would somehow
> assume that you had an infinite (or at least very very high) number of
> seeks at your disposal. Or that seeks were free... In any other
> scenario, the total "capacity" of your underlying storage, the maximum
> amount of bandwidth (including non-free seeks) available, would vary
> depending on how it is currently used (how many seeks are issued) by all
> the clients.
>
> So, what I'm arguing is; you will not want to specify a fixed sequential
> bandwidth for your mp3 player.
>
> What you want to do is this: Allocate 5 iops/sec for your mp3 player
> because either a quick calculation - or - experience has shown that this
> is enough for it to keep its buffer from depleting at all times.

But that is the only number that makes sense. To give some sort of soft
QOS for bandwidth, you need the file given so the kernel can bring in
the meta data (to avoid those seeks) and see how the file is laid out.
For the mp3 case, you should not even need to ask the user anything. The
player app knows exactly how much bandwidth it needs and what kind of
latency, if can tell from the bitrate of the media. What you are arguing
for is doing trial and error with a magic iops/sec metric that is both
hard to understand and impossible to quantify.

> Describing iops/sec for your mp3 player is at least as simple as
> sequential bitrate. The difference is, that you can implement iops/sec
> allocation whereas you cannot implement bitrate allocation (in a
> meaningful way at least) :)
>
>
> > so my question to you is: can you
> > describe what it'd bring the admin to put such an allocation in place?
>
> Limiting on iops/sec rather than bandwidth, is simply accepting that
> bandwidth does not make sense (because you cannot know how much of it
> you have and therefore you cannot slice up your total capacity), and,
> realizing that bandwidth in the scenarios where limiting is interesting
> is in reality bound by seeks rather than sequential on-disk throughput.

I don't understand your arguments, to be honest. If you can tell the
iops/sec rate for a given workload, you can certainly see the bandwidth
as well. Both iops/sec and bandwidth will vary wildly depending on the
workload(s) on the disk.

> > If we find that it can be a good approach.. but if not, I'm less certain
> > this'll be used..
>
> I can only see a problem with specifying iops/sec in the one scenario
> where you have multiple sequential readers or writers, and you want to
> distribute bandwidth between them.

If you only have one app doing io, you don't need QOS. The thing is, you
always have competing apps. Even with only one user space app running,
the kernel may still generate io for you.

> In all other scenarios, I believe iops/sec is by far a superios way of
> describing the ressource allocation. For two reasons:
> 1) It describes what the hardware provides
> 2) By describing a concept based on the real world it may actually be
> possible to implement so that it works as intended

Same arguments. You can't universally state that this disk gives you
80MiB/sec, and you can't universally state that this disk gives you 1000
iops/sec. You need to also define the conditions for when it can provide
this performance. So if you instead say this disk does 80MiB/sec if read
with at least 8KiB blocks from lba 0 to 50000 sequentially. Or you can
state the same with iops/sec.

--
Jens Axboe

2006-10-18 12:23:24

by Jakob Oestergaard

[permalink] [raw]

Subject: Re: Bandwidth Allocations under CFQ I/O Scheduler

On Wed, Oct 18, 2006 at 01:49:14PM +0200, Jens Axboe wrote:
> On Wed, Oct 18 2006, Jakob Oestergaard wrote:
...
> > I have no idea how much bandwidth my database needs... But I have a
> > rough idea about how many I/O operations it does for a given operation.
> > And if I don't, strace can tell me pretty quick :)
>
> That's crazy. So you want a user of this to strace and write a script
> parsing strace output to tell you possibly how many iops/sec you need?

Come up with something better then, genious :)

strace for iops is doable albeit complicated.

Determining MiB/sec requirement for sufficient db performance is
impossible.

> >
> > So, what I'm arguing is; you will not want to specify a fixed sequential
> > bandwidth for your mp3 player.
> >
> > What you want to do is this: Allocate 5 iops/sec for your mp3 player
> > because either a quick calculation - or - experience has shown that this
> > is enough for it to keep its buffer from depleting at all times.
>
> But that is the only number that makes sense. To give some sort of soft
> QOS for bandwidth, you need the file given so the kernel can bring in
> the meta data (to avoid those seeks) and see how the file is laid out.

Ok I see where you're going. I think it sounds very complicated - for
the user and for the kernel.

Would you want to limit bandwidth on a per-file or per-process basis?
You're talking files, above, I was thinking about processes (consumers
if you like) the whole time.

Have you thought about how this would work in the long run, with many
files coming into use? The kernel can't have the meta-data cached for
all files - so the reading-in of metadata would affect the remaining
available disk performance...

> For the mp3 case, you should not even need to ask the user anything. The
> player app knows exactly how much bandwidth it needs and what kind of
> latency, if can tell from the bitrate of the media.

Agreed. And this holds true for both base metrics, bandwidth or iops/sec.

> What you are arguing
> for is doing trial and error

Sort-of correct.

> with a magic iops/sec metric that is both
> hard to understand and impossible to quantify.

iops/sec is what you get from your disks. In real world scenarios. It's
no more magic than the real world, and no harder to understand than real
world disks. Although I admit real-world disks can be a bitch at times ;)

My argument is that it is simpler to understand than bandwidth.

Sure, for the streaming file example bandwidth sounds simple. But how
many real-world applications are like that? What about databases? What
about web servers? What about mail servers? What about 99% of the
real-world applications out there that are not streaming audio or video
players?

> > Limiting on iops/sec rather than bandwidth, is simply accepting that
> > bandwidth does not make sense (because you cannot know how much of it
> > you have and therefore you cannot slice up your total capacity), and,
> > realizing that bandwidth in the scenarios where limiting is interesting
> > is in reality bound by seeks rather than sequential on-disk throughput.
>
> I don't understand your arguments, to be honest. If you can tell the
> iops/sec rate for a given workload, you can certainly see the bandwidth
> as well.

My thesis is, that for most applications it is not the bandwidth you
care about.

If I am not right in this, sure, you have a point then. But hey, how
many of the applications out there are mp3 players? (in other words;
please oh please, prove me wrong, I like it :)

> Both iops/sec and bandwidth will vary wildly depending on the
> workload(s) on the disk.

The total iops/sec "available" from a given disk will not vary a lot,
compared to how the total bandwidth available from a given disk will
vary.

...
> > I can only see a problem with specifying iops/sec in the one scenario
> > where you have multiple sequential readers or writers, and you want to
> > distribute bandwidth between them.
>
> If you only have one app doing io, you don't need QOS.

Precisely!

In the *one* case where it is actually possible to implement a QOS
system based on bandwidth, you don't need QOS.

With more than 1 client, you get seeks, and then bandwidth is no longer
a sensible measure.

> The thing is, you
> always have competing apps. Even with only one user space app running,
> the kernel may still generate io for you.

Sing it brother, sing it! ;)

> > In all other scenarios, I believe iops/sec is by far a superios way of
> > describing the ressource allocation. For two reasons:
> > 1) It describes what the hardware provides
> > 2) By describing a concept based on the real world it may actually be
> > possible to implement so that it works as intended
>
> Same arguments. You can't universally state that this disk gives you
> 80MiB/sec, and you can't universally state that this disk gives you 1000
> iops/sec.

I agree.

But I would be lying a lot less if I made the claim in iops/sec :)

They will vary a factor of two or three, depending on their nature.

Bandwidth will vary three to five orders of magnitude depending on the
nature of the I/O operations issued to the device.

> You need to also define the conditions for when it can provide
> this performance. So if you instead say this disk does 80MiB/sec if read
> with at least 8KiB blocks from lba 0 to 50000 sequentially. Or you can
> state the same with iops/sec.

Yep.

However, for the interface to be useful, it needs two things as I see it
(and I may well be overlooking something):
1) It needs to be simple to use
2) It needs to do what it claims, "well enough"

--

/ jakob

2006-10-18 12:40:41

by Alan

[permalink] [raw]

Subject: Re: Bandwidth Allocations under CFQ I/O Scheduler

Ar Mer, 2006-10-18 am 14:23 +0200, ysgrifennodd Jakob Oestergaard:
> iops/sec is what you get from your disks. In real world scenarios. It's
> no more magic than the real world, and no harder to understand than real
> world disks. Although I admit real-world disks can be a bitch at times ;)

Even iops/sec is very vague and arbitary. If your disk happens to be
retrying a sector or doing a cleaning pass or any other housekeeping or
vibration damping and so on you'll get very different numbers.

Bandwidth is completely silly in this context, iops/sec is merely
hopeless 8)

2006-10-18 12:42:15

by Jens Axboe

[permalink] [raw]

Subject: Re: Bandwidth Allocations under CFQ I/O Scheduler

On Wed, Oct 18 2006, Jakob Oestergaard wrote:
> On Wed, Oct 18, 2006 at 01:49:14PM +0200, Jens Axboe wrote:
> > On Wed, Oct 18 2006, Jakob Oestergaard wrote:
> ...
> > > I have no idea how much bandwidth my database needs... But I have a
> > > rough idea about how many I/O operations it does for a given operation.
> > > And if I don't, strace can tell me pretty quick :)
> >
> > That's crazy. So you want a user of this to strace and write a script
> > parsing strace output to tell you possibly how many iops/sec you need?
>
> Come up with something better then, genious :)
>
> strace for iops is doable albeit complicated.

The concept was already described, bandwidth.

> Determining MiB/sec requirement for sufficient db performance is
> impossible.

But you can say you want to give the db 90% of the disk bandwidth, and
at least 50%. The iops/sec metric doesn't help you.

It's an entirely diffent thing from the mp3 player. With the player app,
you want to have the bitrate available at the right latency. For a db, I
guess you typically want to contain it somehow - make sure it gets at
least foo amount of the disk, but don't let it suck everything.

> > > So, what I'm arguing is; you will not want to specify a fixed sequential
> > > bandwidth for your mp3 player.
> > >
> > > What you want to do is this: Allocate 5 iops/sec for your mp3 player
> > > because either a quick calculation - or - experience has shown that this
> > > is enough for it to keep its buffer from depleting at all times.
> >
> > But that is the only number that makes sense. To give some sort of soft
> > QOS for bandwidth, you need the file given so the kernel can bring in
> > the meta data (to avoid those seeks) and see how the file is laid out.
>
> Ok I see where you're going. I think it sounds very complicated - for
> the user and for the kernel.
>
> Would you want to limit bandwidth on a per-file or per-process basis?
> You're talking files, above, I was thinking about processes (consumers
> if you like) the whole time.

You need to define your workload for the kernel to know what to do. So
for the bandwidth case, you need to tell the kernel against what file
you want to allocate that bandwidth. If you go the percentage route, you
don't need that. The percentage route doesn't care about sequential or
random io, it just gets you foo % of the disk time. If the slice given
is large enough, with 10% of the disk time you may have 90% of the total
bandwidth if the remaining 90% of the time is spent doing random io. But
you still have 10% of the time allocated.

> Have you thought about how this would work in the long run, with many
> files coming into use? The kernel can't have the meta-data cached for
> all files - so the reading-in of metadata would affect the remaining
> available disk performance...

Just like any other system activity affects the disk bandwidth. That's
exactly one of the reasons why you want to operate in terms of time, not
requests.

> > For the mp3 case, you should not even need to ask the user anything. The
> > player app knows exactly how much bandwidth it needs and what kind of
> > latency, if can tell from the bitrate of the media.
>
> Agreed. And this holds true for both base metrics, bandwidth or iops/sec.

Right, because they are sides of the same story. The difference is not
in the metric, but the meaning it gives to the user.

> > What you are arguing
> > for is doing trial and error
>
> Sort-of correct.

How would you otherwise do it?

> > with a magic iops/sec metric that is both
> > hard to understand and impossible to quantify.
>
> iops/sec is what you get from your disks. In real world scenarios. It's
> no more magic than the real world, and no harder to understand than real
> world disks. Although I admit real-world disks can be a bitch at times ;)

Again, iops/sec doesn't make sense unless you say how big the iops is
and what your stream of iops look like. That's why I say it's a
benchmark metric.

> My argument is that it is simpler to understand than bandwidth.

And mine is that that is nonsense :-)

> Sure, for the streaming file example bandwidth sounds simple. But how
> many real-world applications are like that? What about databases? What
> about web servers? What about mail servers? What about 99% of the
> real-world applications out there that are not streaming audio or video
> players?

Reserving bandwidth at x kib/sec for an mp3 player and containing a
different type of app are two separate things. A decent io scheduler
should make sure in general that nobody is totally starved. If you have
5 services running on your machine and you want to make sure that eg the
web server gets 50% of the bandwidth, you will want to inform the kernel
of that fact. Since you don't know what the throughput of the disk is at
any given time (be it Mib/sec or iops/sec, doesn't matter), you can only
say 50% at that time.

I really don't see how this pertains to bandwidth vs iops/sec.

> > > Limiting on iops/sec rather than bandwidth, is simply accepting that
> > > bandwidth does not make sense (because you cannot know how much of it
> > > you have and therefore you cannot slice up your total capacity), and,
> > > realizing that bandwidth in the scenarios where limiting is interesting
> > > is in reality bound by seeks rather than sequential on-disk throughput.
> >
> > I don't understand your arguments, to be honest. If you can tell the
> > iops/sec rate for a given workload, you can certainly see the bandwidth
> > as well.
>
> My thesis is, that for most applications it is not the bandwidth you
> care about.
>
> If I am not right in this, sure, you have a point then. But hey, how
> many of the applications out there are mp3 players? (in other words;
> please oh please, prove me wrong, I like it :)

We are talking about two seperate things here. The mp3 player vs some
other app argument is totally separate from iops/sec vs MiB/sec.

> > Both iops/sec and bandwidth will vary wildly depending on the
> > workload(s) on the disk.
>
> The total iops/sec "available" from a given disk will not vary a lot,
> compared to how the total bandwidth available from a given disk will
> vary.

That's only true if you scale your iops. And how are you going to give
that number? You need to define what an iop is for it to be meaningfull.

> > > I can only see a problem with specifying iops/sec in the one scenario
> > > where you have multiple sequential readers or writers, and you want to
> > > distribute bandwidth between them.
> >
> > If you only have one app doing io, you don't need QOS.
>
> Precisely!
>
> In the *one* case where it is actually possible to implement a QOS
> system based on bandwidth, you don't need QOS.
>
> With more than 1 client, you get seeks, and then bandwidth is no longer
> a sensible measure.

And neither is iops/sec. But things don't deteriorate that quickly, if
you can tolerate higher latency, it's quite possible to have most of the
potential bandwidth available for > 1 client workloads.

--
Jens Axboe

2006-10-18 12:43:42

by Jens Axboe

[permalink] [raw]

Subject: Re: Bandwidth Allocations under CFQ I/O Scheduler

On Wed, Oct 18 2006, Alan Cox wrote:
> Bandwidth is completely silly in this context, iops/sec is merely
> hopeless 8)

Both need the disk to play nicely, if you get into error handling or
correction, you get screwed. Bandwidth by itself is meaningless, you
need latency as well to make sense of it.

--
Jens Axboe

2006-10-18 12:44:50

by Jakob Oestergaard

[permalink] [raw]

Subject: Re: Bandwidth Allocations under CFQ I/O Scheduler

On Wed, Oct 18, 2006 at 01:42:24PM +0100, Alan Cox wrote:
> Ar Mer, 2006-10-18 am 14:23 +0200, ysgrifennodd Jakob Oestergaard:
> > iops/sec is what you get from your disks. In real world scenarios. It's
> > no more magic than the real world, and no harder to understand than real
> > world disks. Although I admit real-world disks can be a bitch at times ;)
>
> Even iops/sec is very vague and arbitary. If your disk happens to be
> retrying a sector or doing a cleaning pass or any other housekeeping or
> vibration damping and so on you'll get very different numbers.

True.

>
> Bandwidth is completely silly in this context, iops/sec is merely
> hopeless 8)

Thanks Alan - I feel much better now :)

--

/ jakob

2006-10-18 12:55:58

by Nick Piggin

[permalink] [raw]

Subject: Re: Bandwidth Allocations under CFQ I/O Scheduler

Jens Axboe wrote:
> On Wed, Oct 18 2006, Alan Cox wrote:
>
>>Bandwidth is completely silly in this context, iops/sec is merely
>>hopeless 8)
>
>
> Both need the disk to play nicely, if you get into error handling or
> correction, you get screwed. Bandwidth by itself is meaningless, you
> need latency as well to make sense of it.

When writing an IO scheduler, I decided `time' was a pretty good
metric. That's roughly what we use for CPU scheduling as well (but
use nice levels and adjusted by dynamic priorities instead of a
straight % share).

So you could say you want your database to consume no more than 50%
of disk and have your mp3 player get a minimum of 10%. Of course,
that doesn't say anything about what the time slices are, or what
latencies you can expect (1s out of every 10, or 100ms out of every
1000?).

It is still far from perfect, but at least it accounts for seeks vs
throughput reasonably well, and in a device independent manner.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-10-18 13:04:19

by Jens Axboe

[permalink] [raw]

Subject: Re: Bandwidth Allocations under CFQ I/O Scheduler

On Wed, Oct 18 2006, Nick Piggin wrote:
> Jens Axboe wrote:
> >On Wed, Oct 18 2006, Alan Cox wrote:
> >
> >>Bandwidth is completely silly in this context, iops/sec is merely
> >>hopeless 8)
> >
> >
> >Both need the disk to play nicely, if you get into error handling or
> >correction, you get screwed. Bandwidth by itself is meaningless, you
> >need latency as well to make sense of it.
>
> When writing an IO scheduler, I decided `time' was a pretty good
> metric. That's roughly what we use for CPU scheduling as well (but
> use nice levels and adjusted by dynamic priorities instead of a
> straight % share).

Precisely, hence CFQ is now based on the time metric. Given larger
slices, you can mostly eliminate the impact of other applications in the
system.

> So you could say you want your database to consume no more than 50%
> of disk and have your mp3 player get a minimum of 10%. Of course,
> that doesn't say anything about what the time slices are, or what
> latencies you can expect (1s out of every 10, or 100ms out of every
> 1000?).

As I wrote previously, both a percentage and bandwidth along with
desired latency make sense. For the mp3 player, you probably don't care
how much of the system it uses. You want 256kbit/sec or whatever you
media is, and if you don't get that then things don't work. The other
scenario is limiting eg the database.

> It is still far from perfect, but at least it accounts for seeks vs
> throughput reasonably well, and in a device independent manner.

We can't aim for perfection, as that is simple not doable generically.
So we have to settle for something that makes sense and is enforcible to
some extent. We can limit something to foo percentage of the disk, and
we can try the hardest possible to satisfy the mp3 player as long as we
know what it requires. Right now we don't, so we treat everybody the
same wrt slices and latency.

--
Jens Axboe

2006-10-18 13:35:30

by Jakob Oestergaard

[permalink] [raw]

Subject: Re: Bandwidth Allocations under CFQ I/O Scheduler

On Wed, Oct 18, 2006 at 02:42:53PM +0200, Jens Axboe wrote:
...
> > impossible.
>
> But you can say you want to give the db 90% of the disk bandwidth, and
> at least 50%. The iops/sec metric doesn't help you.

I think we're misunderstanding each other...

I am trying to say, that me being able to specify "90% of the disk
bandwidth" does not help me.

Because the DB would probably be happy with just 1% of the 100MiB/sec
theoretical bandwidth I could get from sequentially reading the disk -
but if it needs to do, say, 160 seeks per second to get those 1% of
100MiB/sec, then that is still more than 96% of the disk time available
with a 6ms seek time.

So, I believe we need something that takes into account the general
performance of the disk - not just the single-user-sequential-read/write
bandwidth. And, as I shall soon argue, this is where I do think the
iops/sec metric does help - I probably just explained it very poorly to
begin with.

> >
> > Would you want to limit bandwidth on a per-file or per-process basis?
> > You're talking files, above, I was thinking about processes (consumers
> > if you like) the whole time.
>
> You need to define your workload for the kernel to know what to do. So
> for the bandwidth case, you need to tell the kernel against what file
> you want to allocate that bandwidth. If you go the percentage route, you
> don't need that. The percentage route doesn't care about sequential or
> random io, it just gets you foo % of the disk time. If the slice given
> is large enough, with 10% of the disk time you may have 90% of the total
> bandwidth if the remaining 90% of the time is spent doing random io. But
> you still have 10% of the time allocated.

I like the time allocation for several reasons:
1) It's presumably simple to implement
2) It will suit both your mp3 player and my database reasonably well
3) It's intuitive to the user - you can understand wall-clock time a lot
easier than all the little things than influence whether or not you
get a number of bytes written in a number of places on the disk in
more or less than the time you had available...

I think "reasonably well" is good enough for a kernel that isn't
hard-real-time anyway :)

...
[snip - good arguments, response will follow]
...

> > > with a magic iops/sec metric that is both
> > > hard to understand and impossible to quantify.
> >
> > iops/sec is what you get from your disks. In real world scenarios. It's
> > no more magic than the real world, and no harder to understand than real
> > world disks. Although I admit real-world disks can be a bitch at times ;)
>
> Again, iops/sec doesn't make sense unless you say how big the iops is

1 OSIOP (oestergaard standard input/output operation) is hereby defined
to be:
1 optional seek
plus
1 (read or write) of no more than 256 KiB (*)

(*): The size limit should be adjusted every 10 years as disk technology
evolves.

There you have it :)

So, a single 1MiB read on a disk is 4 OSIOPs, for example.

> and what your stream of iops look like. That's why I say it's a
> benchmark metric.

I state that the total OSIOPs/second you can get out of a given disk
will not change by much, no matter which disk operations you perform and
how you mix them.

That was the whole point of using OSIOPs/sec rather than bandwidth to
begin with.

I know I did not properly define the iop to begin with - my bad, sorry.

>
> > My argument is that it is simpler to understand than bandwidth.
>
> And mine is that that is nonsense :-)

Still? :)

I hope the above clears up some of the misunderstandings.

...
...
> > The total iops/sec "available" from a given disk will not vary a lot,
> > compared to how the total bandwidth available from a given disk will
> > vary.
>
> That's only true if you scale your iops. And how are you going to give
> that number? You need to define what an iop is for it to be meaningfull.

Done :)

A basic OSIOP is useful for the application, because it maps very
closely to the read/write/seek API that applications are built over.
Thus, the application will know very well how many OSIOPs it needs in
order to complete a given job.

The total number of OSIOPs/sec available in the system, however, will
vary depending on the characteristics of the disk subsystem. Just like
available cycles/sec vary with the speed of your processor.

You are correct in that the total number of OSIOPs/sec will not be
strictly constant over time - it will depend *somewhat* on the nature of
the operations performed. But it will not change completely - or at
least this is what I claim :)

...
> > With more than 1 client, you get seeks, and then bandwidth is no longer
> > a sensible measure.
>
> And neither is iops/sec.

We agree that neither is "correct".

I still claim that one is "not strictly correct but probably close
enough to be useful".

> But things don't deteriorate that quickly, if
> you can tolerate higher latency, it's quite possible to have most of the
> potential bandwidth available for > 1 client workloads.

True.

I do wonder, though, how often that would be practically useful. Seek
times are *huge* (milliseconds) compared to almost anything else we work
with.

--

/ jakob

2006-10-18 13:37:40

by Jakob Oestergaard

[permalink] [raw]

Subject: Re: Bandwidth Allocations under CFQ I/O Scheduler

On Wed, Oct 18, 2006 at 10:55:55PM +1000, Nick Piggin wrote:
...
> So you could say you want your database to consume no more than 50%
> of disk and have your mp3 player get a minimum of 10%. Of course,
> that doesn't say anything about what the time slices are, or what
> latencies you can expect (1s out of every 10, or 100ms out of every
> 1000?).
>
> It is still far from perfect, but at least it accounts for seeks vs
> throughput reasonably well, and in a device independent manner.

Yup - it makes sense.

It would make very good sense (to me at least) if you can say "give me
at least 100msec every 1sec", as was already suggested. That would take
care of the latency problem too.

--

/ jakob

2006-10-18 13:39:32

by Jakob Oestergaard

[permalink] [raw]

Subject: Re: Bandwidth Allocations under CFQ I/O Scheduler

On Wed, Oct 18, 2006 at 03:04:57PM +0200, Jens Axboe wrote:
...
> > So you could say you want your database to consume no more than 50%
> > of disk and have your mp3 player get a minimum of 10%. Of course,
> > that doesn't say anything about what the time slices are, or what
> > latencies you can expect (1s out of every 10, or 100ms out of every
> > 1000?).
>
> As I wrote previously, both a percentage and bandwidth along with
> desired latency make sense.

The fundamental problem I see, is, that while you can easily measure and
limit the bandwidth, you cannot really make bandwidth guarantees.

All the rest sounds great in my ears :)

--

/ jakob

2006-10-18 13:51:09

by Paulo Marques

[permalink] [raw]

Subject: Re: Bandwidth Allocations under CFQ I/O Scheduler

Jens Axboe wrote:
>[...]
> Precisely, hence CFQ is now based on the time metric. Given larger
> slices, you can mostly eliminate the impact of other applications in the
> system.

Just one thought: we can't predict reliably how much time a request will
take to be serviced, but we can account the time it _took_ to service a
request.

If we account the time it took to service requests for each process, and
we have several processes with requests pending, we can use the same
algorithm we would use for a large time slice algorithm to select the
process to service.

This should make it as fair over time as a large time slice algorithm
and doesn't need large time slices, so latencies can be kept as low as
required.

However, having a small time slice will probably help the hardware
coalesce several request from the same process that are more likely to
be to nearby sectors, and thus improve performance.

I'm leaving out the details, like we should find a way to make the
"fairness" work over a time window and not over the entire process
lifespan, maybe by using a sliding window over the last N seconds of
serviced requests to do the accounting or something.

--
Paulo Marques - http://www.grupopie.com

"The face of a child can say it all, especially the
mouth part of the face."

2006-10-19 12:21:19

by Jens Axboe

[permalink] [raw]

Subject: Re: Bandwidth Allocations under CFQ I/O Scheduler

On Wed, Oct 18 2006, Paulo Marques wrote:
> Jens Axboe wrote:
> >[...]
> >Precisely, hence CFQ is now based on the time metric. Given larger
> >slices, you can mostly eliminate the impact of other applications in the
> >system.
>
> Just one thought: we can't predict reliably how much time a request will
> take to be serviced, but we can account the time it _took_ to service a
> request.
>
> If we account the time it took to service requests for each process, and
> we have several processes with requests pending, we can use the same
> algorithm we would use for a large time slice algorithm to select the
> process to service.
>
> This should make it as fair over time as a large time slice algorithm
> and doesn't need large time slices, so latencies can be kept as low as
> required.

Two problems:

- You can't chop things down to single request times. A cost of a
request greatly varies depending on what preceeded it, hence you need
to account batches of requests from a process - this is what the time
slice currently accomplishes.

- Whether a process has requests pending or not varies a lot. The
typical bandwidth problem is due to processes doing sync or dependent
io where you only get io in pieces over time.

A request based approach only works over processes that always (or
almost always) have work left to do. You absolutely need the time slice
or some other waiting mechanism to help those that don't.

> However, having a small time slice will probably help the hardware
> coalesce several request from the same process that are more likely to
> be to nearby sectors, and thus improve performance.

Either the process is submittinger larger amounts of io and you'll get
the merging anyways, or it isn't. There's a large difference in time
scales here.

--
Jens Axboe