LinuxLists.cc - Re: [RFC] IO scheduler based IO controller V9

2009-09-10 15:19:24

Subject: Re: [RFC] IO scheduler based IO controller V9

Vivek Goyal wrote:
> Hi All,
>
> Here is the V9 of the IO controller patches generated on top of 2.6.31-rc7.

Hi Vivek,

I've run some postgresql benchmarks for io-controller. Tests have been
made with 2.6.31-rc6 kernel, without io-controller patches (when
relevant) and with io-controller v8 and v9 patches.
I set up two instances of the TPC-H database, each running in their
own io-cgroup. I ran two clients to these databases and tested on each
that simple request:
$ select count(*) from LINEITEM;
where LINEITEM is the biggest table of TPC-H (6001215 entries,
720MB). That request generates a steady stream of IOs.

Time is measure by psql (\timing switched on). Each test is run twice
or more if there is any significant difference between the first two
runs. Before each run, the cache is flush:
$ echo 3 > /proc/sys/vm/drop_caches

Results with 2 groups of same io policy (BE) and same io weight (1000):

w/o io-scheduler io-scheduler v8 io-scheduler v9
first second first second first second
DB DB DB DB DB DB

CFQ 48.4s 48.4s 48.2s 48.2s 48.1s 48.5s
Noop 138.0s 138.0s 48.3s 48.4s 48.5s 48.8s
AS 46.3s 47.0s 48.5s 48.7s 48.3s 48.5s
Deadl. 137.1s 137.1s 48.2s 48.3s 48.3s 48.5s

As you can see, there is no significant difference for CFQ
scheduler. There is big improvement for noop and deadline schedulers
(why is that happening?). The performance with anticipatory scheduler
is a bit lower (~4%).

Results with 2 groups of same io policy (BE), different io weights and
CFQ scheduler:
io-scheduler v8 io-scheduler v9
weights = 1000, 500 35.6s 46.7s 35.6s 46.7s
weigths = 1000, 250 29.2s 45.8s 29.2s 45.6s

The result in term of fairness is close to what we can expect from the
ideal theoric case: with io weights of 1000 and 500 (1000 and 250),
the first request get 2/3 (4/5) of io time as long as it runs and thus
finish in about 3/4 (5/8) of total time.

Results with 2 groups of different io policies, same io weight and
CFQ scheduler:
io-scheduler v8 io-scheduler v9
policy = RT, BE 22.5s 45.3s 22.4s 45.0s
policy = BE, IDLE 22.6s 44.8s 22.4s 45.0s

Here again, the result in term of fairness is very close from what we
expect.

Thanks,
Jerome

2009-09-10 20:54:06

by Vivek Goyal

[permalink] [raw]

Subject: Re: [RFC] IO scheduler based IO controller V9

On Thu, Sep 10, 2009 at 05:18:25PM +0200, Jerome Marchand wrote:
> Vivek Goyal wrote:
> > Hi All,
> >
> > Here is the V9 of the IO controller patches generated on top of 2.6.31-rc7.
>
> Hi Vivek,
>
> I've run some postgresql benchmarks for io-controller. Tests have been
> made with 2.6.31-rc6 kernel, without io-controller patches (when
> relevant) and with io-controller v8 and v9 patches.
> I set up two instances of the TPC-H database, each running in their
> own io-cgroup. I ran two clients to these databases and tested on each
> that simple request:
> $ select count(*) from LINEITEM;
> where LINEITEM is the biggest table of TPC-H (6001215 entries,
> 720MB). That request generates a steady stream of IOs.
>
> Time is measure by psql (\timing switched on). Each test is run twice
> or more if there is any significant difference between the first two
> runs. Before each run, the cache is flush:
> $ echo 3 > /proc/sys/vm/drop_caches
>
>
> Results with 2 groups of same io policy (BE) and same io weight (1000):
>
> w/o io-scheduler io-scheduler v8 io-scheduler v9
> first second first second first second
> DB DB DB DB DB DB
>
> CFQ 48.4s 48.4s 48.2s 48.2s 48.1s 48.5s
> Noop 138.0s 138.0s 48.3s 48.4s 48.5s 48.8s
> AS 46.3s 47.0s 48.5s 48.7s 48.3s 48.5s
> Deadl. 137.1s 137.1s 48.2s 48.3s 48.3s 48.5s
>
> As you can see, there is no significant difference for CFQ
> scheduler.

Thanks Jerome.

> There is big improvement for noop and deadline schedulers
> (why is that happening?).

I think because now related IO is in a single queue and it gets to run
for 100ms or so (like CFQ). So previously, IO from both the instances
will go into a single queue which should lead to more seeks as requests
from two groups will kind of get interleaved.

With io controller, both groups have separate queues so requests from
both the data based instances will not get interleaved (This almost
becomes like CFQ where ther are separate queues for each io context
and for sequential reader, one io context gets to run nicely for certain
ms based on its priority).

> The performance with anticipatory scheduler
> is a bit lower (~4%).
>

I will run some tests with AS and see if I can reproduce this lower
performance and attribute it to a particular piece of code.

>
> Results with 2 groups of same io policy (BE), different io weights and
> CFQ scheduler:
> io-scheduler v8 io-scheduler v9
> weights = 1000, 500 35.6s 46.7s 35.6s 46.7s
> weigths = 1000, 250 29.2s 45.8s 29.2s 45.6s
>
> The result in term of fairness is close to what we can expect from the
> ideal theoric case: with io weights of 1000 and 500 (1000 and 250),
> the first request get 2/3 (4/5) of io time as long as it runs and thus
> finish in about 3/4 (5/8) of total time.
>

Jerome, after 36.6 seconds, disk will be fully given to second group.
Hence these times might not reflect the accurate measure of who got how
much of disk time.

Can you just capture the output of "io.disk_time" file in both the cgroups
at the time of completion of task in higher weight group. Alternatively,
you can just run this a script in a loop which prints the output of
"cat io.disk_time | grep major:minor" every 2 seconds. That way we can
see how disk times are being distributed between groups.

>
> Results with 2 groups of different io policies, same io weight and
> CFQ scheduler:
> io-scheduler v8 io-scheduler v9
> policy = RT, BE 22.5s 45.3s 22.4s 45.0s
> policy = BE, IDLE 22.6s 44.8s 22.4s 45.0s
>
> Here again, the result in term of fairness is very close from what we
> expect.

Same as above in this case too.

These seem to be good test for fairness measurement in case of streaming
readers. I think one more interesting test case will be do how are the
random read latencies in case of multiple streaming readers present.

So if we can launch 4-5 dd processes in one group and then issue some
random small queueries on postgresql in second group, I am keen to see
how quickly the query can be completed with and without io controller.
Would be interesting to see at results for all 4 io schedulers.

Thanks
Vivek

2009-09-10 20:58:21

by Vivek Goyal

[permalink] [raw]

Subject: Re: [RFC] IO scheduler based IO controller V9

On Thu, Sep 10, 2009 at 04:52:27PM -0400, Vivek Goyal wrote:
> On Thu, Sep 10, 2009 at 05:18:25PM +0200, Jerome Marchand wrote:
> > Vivek Goyal wrote:
> > > Hi All,
> > >
> > > Here is the V9 of the IO controller patches generated on top of 2.6.31-rc7.
> >
> > Hi Vivek,
> >
> > I've run some postgresql benchmarks for io-controller. Tests have been
> > made with 2.6.31-rc6 kernel, without io-controller patches (when
> > relevant) and with io-controller v8 and v9 patches.
> > I set up two instances of the TPC-H database, each running in their
> > own io-cgroup. I ran two clients to these databases and tested on each
> > that simple request:
> > $ select count(*) from LINEITEM;
> > where LINEITEM is the biggest table of TPC-H (6001215 entries,
> > 720MB). That request generates a steady stream of IOs.
> >
> > Time is measure by psql (\timing switched on). Each test is run twice
> > or more if there is any significant difference between the first two
> > runs. Before each run, the cache is flush:
> > $ echo 3 > /proc/sys/vm/drop_caches
> >
> >
> > Results with 2 groups of same io policy (BE) and same io weight (1000):
> >
> > w/o io-scheduler io-scheduler v8 io-scheduler v9
> > first second first second first second
> > DB DB DB DB DB DB
> >
> > CFQ 48.4s 48.4s 48.2s 48.2s 48.1s 48.5s
> > Noop 138.0s 138.0s 48.3s 48.4s 48.5s 48.8s
> > AS 46.3s 47.0s 48.5s 48.7s 48.3s 48.5s
> > Deadl. 137.1s 137.1s 48.2s 48.3s 48.3s 48.5s
> >
> > As you can see, there is no significant difference for CFQ
> > scheduler.
>
> Thanks Jerome.
>
> > There is big improvement for noop and deadline schedulers
> > (why is that happening?).
>
> I think because now related IO is in a single queue and it gets to run
> for 100ms or so (like CFQ). So previously, IO from both the instances
> will go into a single queue which should lead to more seeks as requests
> from two groups will kind of get interleaved.
>
> With io controller, both groups have separate queues so requests from
> both the data based instances will not get interleaved (This almost
> becomes like CFQ where ther are separate queues for each io context
> and for sequential reader, one io context gets to run nicely for certain
> ms based on its priority).
>
> > The performance with anticipatory scheduler
> > is a bit lower (~4%).
> >

Hi Jerome,

Can you also run the AS test with io controller patches and both the
database in root group (basically don't put them in to separate group). I
suspect that this regression might come from that fact that we now have
to switch between queues and in AS we wait for request to finish from
previous queue before next queue is scheduled in and probably that is
slowing down things a bit.., just a wild guess..

Thanks
Vivek

>
> I will run some tests with AS and see if I can reproduce this lower
> performance and attribute it to a particular piece of code.
>
> >
> > Results with 2 groups of same io policy (BE), different io weights and
> > CFQ scheduler:
> > io-scheduler v8 io-scheduler v9
> > weights = 1000, 500 35.6s 46.7s 35.6s 46.7s
> > weigths = 1000, 250 29.2s 45.8s 29.2s 45.6s
> >
> > The result in term of fairness is close to what we can expect from the
> > ideal theoric case: with io weights of 1000 and 500 (1000 and 250),
> > the first request get 2/3 (4/5) of io time as long as it runs and thus
> > finish in about 3/4 (5/8) of total time.
> >
>
> Jerome, after 36.6 seconds, disk will be fully given to second group.
> Hence these times might not reflect the accurate measure of who got how
> much of disk time.
>
> Can you just capture the output of "io.disk_time" file in both the cgroups
> at the time of completion of task in higher weight group. Alternatively,
> you can just run this a script in a loop which prints the output of
> "cat io.disk_time | grep major:minor" every 2 seconds. That way we can
> see how disk times are being distributed between groups.
>
> >
> > Results with 2 groups of different io policies, same io weight and
> > CFQ scheduler:
> > io-scheduler v8 io-scheduler v9
> > policy = RT, BE 22.5s 45.3s 22.4s 45.0s
> > policy = BE, IDLE 22.6s 44.8s 22.4s 45.0s
> >
> > Here again, the result in term of fairness is very close from what we
> > expect.
>
> Same as above in this case too.
>
> These seem to be good test for fairness measurement in case of streaming
> readers. I think one more interesting test case will be do how are the
> random read latencies in case of multiple streaming readers present.
>
> So if we can launch 4-5 dd processes in one group and then issue some
> random small queueries on postgresql in second group, I am keen to see
> how quickly the query can be completed with and without io controller.
> Would be interesting to see at results for all 4 io schedulers.
>
> Thanks
> Vivek

2009-09-11 13:18:26

by Jerome Marchand

[permalink] [raw]

Subject: Re: [RFC] IO scheduler based IO controller V9

Vivek Goyal wrote:
> On Thu, Sep 10, 2009 at 04:52:27PM -0400, Vivek Goyal wrote:
>> On Thu, Sep 10, 2009 at 05:18:25PM +0200, Jerome Marchand wrote:
>>> Vivek Goyal wrote:
>>>> Hi All,
>>>>
>>>> Here is the V9 of the IO controller patches generated on top of 2.6.31-rc7.
>>>
>>> Hi Vivek,
>>>
>>> I've run some postgresql benchmarks for io-controller. Tests have been
>>> made with 2.6.31-rc6 kernel, without io-controller patches (when
>>> relevant) and with io-controller v8 and v9 patches.
>>> I set up two instances of the TPC-H database, each running in their
>>> own io-cgroup. I ran two clients to these databases and tested on each
>>> that simple request:
>>> $ select count(*) from LINEITEM;
>>> where LINEITEM is the biggest table of TPC-H (6001215 entries,
>>> 720MB). That request generates a steady stream of IOs.
>>>
>>> Time is measure by psql (\timing switched on). Each test is run twice
>>> or more if there is any significant difference between the first two
>>> runs. Before each run, the cache is flush:
>>> $ echo 3 > /proc/sys/vm/drop_caches
>>>
>>>
>>> Results with 2 groups of same io policy (BE) and same io weight (1000):
>>>
>>> w/o io-scheduler io-scheduler v8 io-scheduler v9
>>> first second first second first second
>>> DB DB DB DB DB DB
>>>
>>> CFQ 48.4s 48.4s 48.2s 48.2s 48.1s 48.5s
>>> Noop 138.0s 138.0s 48.3s 48.4s 48.5s 48.8s
>>> AS 46.3s 47.0s 48.5s 48.7s 48.3s 48.5s
>>> Deadl. 137.1s 137.1s 48.2s 48.3s 48.3s 48.5s
>>>
>>> As you can see, there is no significant difference for CFQ
>>> scheduler.
>> Thanks Jerome.
>>
>>> There is big improvement for noop and deadline schedulers
>>> (why is that happening?).
>> I think because now related IO is in a single queue and it gets to run
>> for 100ms or so (like CFQ). So previously, IO from both the instances
>> will go into a single queue which should lead to more seeks as requests
>> from two groups will kind of get interleaved.
>>
>> With io controller, both groups have separate queues so requests from
>> both the data based instances will not get interleaved (This almost
>> becomes like CFQ where ther are separate queues for each io context
>> and for sequential reader, one io context gets to run nicely for certain
>> ms based on its priority).
>>
>>> The performance with anticipatory scheduler
>>> is a bit lower (~4%).
>>>
>
> Hi Jerome,
>
> Can you also run the AS test with io controller patches and both the
> database in root group (basically don't put them in to separate group). I
> suspect that this regression might come from that fact that we now have
> to switch between queues and in AS we wait for request to finish from
> previous queue before next queue is scheduled in and probably that is
> slowing down things a bit.., just a wild guess..
>

Hi Vivek,

I guess that's not the reason. I got 46.6s for both DB in root group with
io-controller v9 patches. I also rerun the test with DB in different groups
and found about the same result as above (48.3s and 48.6s).

Jerome

> Thanks
> Vivek
>
>> I will run some tests with AS and see if I can reproduce this lower
>> performance and attribute it to a particular piece of code.
>>
>>> Results with 2 groups of same io policy (BE), different io weights and
>>> CFQ scheduler:
>>> io-scheduler v8 io-scheduler v9
>>> weights = 1000, 500 35.6s 46.7s 35.6s 46.7s
>>> weigths = 1000, 250 29.2s 45.8s 29.2s 45.6s
>>>
>>> The result in term of fairness is close to what we can expect from the
>>> ideal theoric case: with io weights of 1000 and 500 (1000 and 250),
>>> the first request get 2/3 (4/5) of io time as long as it runs and thus
>>> finish in about 3/4 (5/8) of total time.
>>>
>> Jerome, after 36.6 seconds, disk will be fully given to second group.
>> Hence these times might not reflect the accurate measure of who got how
>> much of disk time.
>>
>> Can you just capture the output of "io.disk_time" file in both the cgroups
>> at the time of completion of task in higher weight group. Alternatively,
>> you can just run this a script in a loop which prints the output of
>> "cat io.disk_time | grep major:minor" every 2 seconds. That way we can
>> see how disk times are being distributed between groups.
>>
>>> Results with 2 groups of different io policies, same io weight and
>>> CFQ scheduler:
>>> io-scheduler v8 io-scheduler v9
>>> policy = RT, BE 22.5s 45.3s 22.4s 45.0s
>>> policy = BE, IDLE 22.6s 44.8s 22.4s 45.0s
>>>
>>> Here again, the result in term of fairness is very close from what we
>>> expect.
>> Same as above in this case too.
>>
>> These seem to be good test for fairness measurement in case of streaming
>> readers. I think one more interesting test case will be do how are the
>> random read latencies in case of multiple streaming readers present.
>>
>> So if we can launch 4-5 dd processes in one group and then issue some
>> random small queueries on postgresql in second group, I am keen to see
>> how quickly the query can be completed with and without io controller.
>> Would be interesting to see at results for all 4 io schedulers.
>>
>> Thanks
>> Vivek

2009-09-11 14:32:35

by Vivek Goyal

[permalink] [raw]

Subject: Re: [RFC] IO scheduler based IO controller V9

On Fri, Sep 11, 2009 at 03:16:23PM +0200, Jerome Marchand wrote:
> Vivek Goyal wrote:
> > On Thu, Sep 10, 2009 at 04:52:27PM -0400, Vivek Goyal wrote:
> >> On Thu, Sep 10, 2009 at 05:18:25PM +0200, Jerome Marchand wrote:
> >>> Vivek Goyal wrote:
> >>>> Hi All,
> >>>>
> >>>> Here is the V9 of the IO controller patches generated on top of 2.6.31-rc7.
> >>>
> >>> Hi Vivek,
> >>>
> >>> I've run some postgresql benchmarks for io-controller. Tests have been
> >>> made with 2.6.31-rc6 kernel, without io-controller patches (when
> >>> relevant) and with io-controller v8 and v9 patches.
> >>> I set up two instances of the TPC-H database, each running in their
> >>> own io-cgroup. I ran two clients to these databases and tested on each
> >>> that simple request:
> >>> $ select count(*) from LINEITEM;
> >>> where LINEITEM is the biggest table of TPC-H (6001215 entries,
> >>> 720MB). That request generates a steady stream of IOs.
> >>>
> >>> Time is measure by psql (\timing switched on). Each test is run twice
> >>> or more if there is any significant difference between the first two
> >>> runs. Before each run, the cache is flush:
> >>> $ echo 3 > /proc/sys/vm/drop_caches
> >>>
> >>>
> >>> Results with 2 groups of same io policy (BE) and same io weight (1000):
> >>>
> >>> w/o io-scheduler io-scheduler v8 io-scheduler v9
> >>> first second first second first second
> >>> DB DB DB DB DB DB
> >>>
> >>> CFQ 48.4s 48.4s 48.2s 48.2s 48.1s 48.5s
> >>> Noop 138.0s 138.0s 48.3s 48.4s 48.5s 48.8s
> >>> AS 46.3s 47.0s 48.5s 48.7s 48.3s 48.5s
> >>> Deadl. 137.1s 137.1s 48.2s 48.3s 48.3s 48.5s
> >>>
> >>> As you can see, there is no significant difference for CFQ
> >>> scheduler.
> >> Thanks Jerome.
> >>
> >>> There is big improvement for noop and deadline schedulers
> >>> (why is that happening?).
> >> I think because now related IO is in a single queue and it gets to run
> >> for 100ms or so (like CFQ). So previously, IO from both the instances
> >> will go into a single queue which should lead to more seeks as requests
> >> from two groups will kind of get interleaved.
> >>
> >> With io controller, both groups have separate queues so requests from
> >> both the data based instances will not get interleaved (This almost
> >> becomes like CFQ where ther are separate queues for each io context
> >> and for sequential reader, one io context gets to run nicely for certain
> >> ms based on its priority).
> >>
> >>> The performance with anticipatory scheduler
> >>> is a bit lower (~4%).
> >>>
> >
> > Hi Jerome,
> >
> > Can you also run the AS test with io controller patches and both the
> > database in root group (basically don't put them in to separate group). I
> > suspect that this regression might come from that fact that we now have
> > to switch between queues and in AS we wait for request to finish from
> > previous queue before next queue is scheduled in and probably that is
> > slowing down things a bit.., just a wild guess..
> >
>
> Hi Vivek,
>
> I guess that's not the reason. I got 46.6s for both DB in root group with
> io-controller v9 patches. I also rerun the test with DB in different groups
> and found about the same result as above (48.3s and 48.6s).
>

Hi Jerome,

Ok, so when both the DB's are in root group (with io-controller V9
patches), then you get 46.6 seconds time for both the DBs. That means there
is no regression in this case. In this case there is only one queue of
root group and AS is running timed read/write batches on this queue.

But when both the DBs are put in separate groups then you get 48.3 and
48.6 seconds respectively and we see regression. In this case there are
two queues belonging to each group. Elevator layer takes care of queue
group queue switch and AS runs timed read/write batches on these queues.

If it is correct, then it does not exclude the possiblity that it is queue
switching overhead between groups?

Thanks
Vivek

2009-09-11 14:46:17

by Vivek Goyal

[permalink] [raw]

Subject: Re: [RFC] IO scheduler based IO controller V9

On Fri, Sep 11, 2009 at 10:30:40AM -0400, Vivek Goyal wrote:
> On Fri, Sep 11, 2009 at 03:16:23PM +0200, Jerome Marchand wrote:
> > Vivek Goyal wrote:
> > > On Thu, Sep 10, 2009 at 04:52:27PM -0400, Vivek Goyal wrote:
> > >> On Thu, Sep 10, 2009 at 05:18:25PM +0200, Jerome Marchand wrote:
> > >>> Vivek Goyal wrote:
> > >>>> Hi All,
> > >>>>
> > >>>> Here is the V9 of the IO controller patches generated on top of 2.6.31-rc7.
> > >>>
> > >>> Hi Vivek,
> > >>>
> > >>> I've run some postgresql benchmarks for io-controller. Tests have been
> > >>> made with 2.6.31-rc6 kernel, without io-controller patches (when
> > >>> relevant) and with io-controller v8 and v9 patches.
> > >>> I set up two instances of the TPC-H database, each running in their
> > >>> own io-cgroup. I ran two clients to these databases and tested on each
> > >>> that simple request:
> > >>> $ select count(*) from LINEITEM;
> > >>> where LINEITEM is the biggest table of TPC-H (6001215 entries,
> > >>> 720MB). That request generates a steady stream of IOs.
> > >>>
> > >>> Time is measure by psql (\timing switched on). Each test is run twice
> > >>> or more if there is any significant difference between the first two
> > >>> runs. Before each run, the cache is flush:
> > >>> $ echo 3 > /proc/sys/vm/drop_caches
> > >>>
> > >>>
> > >>> Results with 2 groups of same io policy (BE) and same io weight (1000):
> > >>>
> > >>> w/o io-scheduler io-scheduler v8 io-scheduler v9
> > >>> first second first second first second
> > >>> DB DB DB DB DB DB
> > >>>
> > >>> CFQ 48.4s 48.4s 48.2s 48.2s 48.1s 48.5s
> > >>> Noop 138.0s 138.0s 48.3s 48.4s 48.5s 48.8s
> > >>> AS 46.3s 47.0s 48.5s 48.7s 48.3s 48.5s
> > >>> Deadl. 137.1s 137.1s 48.2s 48.3s 48.3s 48.5s
> > >>>
> > >>> As you can see, there is no significant difference for CFQ
> > >>> scheduler.
> > >> Thanks Jerome.
> > >>
> > >>> There is big improvement for noop and deadline schedulers
> > >>> (why is that happening?).
> > >> I think because now related IO is in a single queue and it gets to run
> > >> for 100ms or so (like CFQ). So previously, IO from both the instances
> > >> will go into a single queue which should lead to more seeks as requests
> > >> from two groups will kind of get interleaved.
> > >>
> > >> With io controller, both groups have separate queues so requests from
> > >> both the data based instances will not get interleaved (This almost
> > >> becomes like CFQ where ther are separate queues for each io context
> > >> and for sequential reader, one io context gets to run nicely for certain
> > >> ms based on its priority).
> > >>
> > >>> The performance with anticipatory scheduler
> > >>> is a bit lower (~4%).
> > >>>
> > >
> > > Hi Jerome,
> > >
> > > Can you also run the AS test with io controller patches and both the
> > > database in root group (basically don't put them in to separate group). I
> > > suspect that this regression might come from that fact that we now have
> > > to switch between queues and in AS we wait for request to finish from
> > > previous queue before next queue is scheduled in and probably that is
> > > slowing down things a bit.., just a wild guess..
> > >
> >
> > Hi Vivek,
> >
> > I guess that's not the reason. I got 46.6s for both DB in root group with
> > io-controller v9 patches. I also rerun the test with DB in different groups
> > and found about the same result as above (48.3s and 48.6s).
> >
>
> Hi Jerome,
>
> Ok, so when both the DB's are in root group (with io-controller V9
> patches), then you get 46.6 seconds time for both the DBs. That means there
> is no regression in this case. In this case there is only one queue of
> root group and AS is running timed read/write batches on this queue.
>
> But when both the DBs are put in separate groups then you get 48.3 and
> 48.6 seconds respectively and we see regression. In this case there are
> two queues belonging to each group. Elevator layer takes care of queue
> group queue switch and AS runs timed read/write batches on these queues.
>
> If it is correct, then it does not exclude the possiblity that it is queue
> switching overhead between groups?
>

Does your hard drive support command queuing? May be we are driving deeper
queue depths for reads and during queue switch we will wait for requests
to finish from last queue to finish before next queue is scheduled in (for
AS) and that probably will cause more delay if we are driving deeper queue
depth.

Can you please set queue depth to "1" (/sys/block/<disk>/device/queue_depth) on
this disk and see time consumed in two cases are same or different. I think
setting depth to "1" will bring down overall throughput but if times are same
in two cases, at least we will know where the delay is coming from.

Thanks
Vivek

2009-09-11 14:47:04

by Jerome Marchand

[permalink] [raw]

Subject: Re: [RFC] IO scheduler based IO controller V9

Vivek Goyal wrote:
> On Fri, Sep 11, 2009 at 03:16:23PM +0200, Jerome Marchand wrote:
>> Vivek Goyal wrote:
>>> On Thu, Sep 10, 2009 at 04:52:27PM -0400, Vivek Goyal wrote:
>>>> On Thu, Sep 10, 2009 at 05:18:25PM +0200, Jerome Marchand wrote:
>>>>> Vivek Goyal wrote:
>>>>>> Hi All,
>>>>>>
>>>>>> Here is the V9 of the IO controller patches generated on top of 2.6.31-rc7.
>>>>>
>>>>> Hi Vivek,
>>>>>
>>>>> I've run some postgresql benchmarks for io-controller. Tests have been
>>>>> made with 2.6.31-rc6 kernel, without io-controller patches (when
>>>>> relevant) and with io-controller v8 and v9 patches.
>>>>> I set up two instances of the TPC-H database, each running in their
>>>>> own io-cgroup. I ran two clients to these databases and tested on each
>>>>> that simple request:
>>>>> $ select count(*) from LINEITEM;
>>>>> where LINEITEM is the biggest table of TPC-H (6001215 entries,
>>>>> 720MB). That request generates a steady stream of IOs.
>>>>>
>>>>> Time is measure by psql (\timing switched on). Each test is run twice
>>>>> or more if there is any significant difference between the first two
>>>>> runs. Before each run, the cache is flush:
>>>>> $ echo 3 > /proc/sys/vm/drop_caches
>>>>>
>>>>>
>>>>> Results with 2 groups of same io policy (BE) and same io weight (1000):
>>>>>
>>>>> w/o io-scheduler io-scheduler v8 io-scheduler v9
>>>>> first second first second first second
>>>>> DB DB DB DB DB DB
>>>>>
>>>>> CFQ 48.4s 48.4s 48.2s 48.2s 48.1s 48.5s
>>>>> Noop 138.0s 138.0s 48.3s 48.4s 48.5s 48.8s
>>>>> AS 46.3s 47.0s 48.5s 48.7s 48.3s 48.5s
>>>>> Deadl. 137.1s 137.1s 48.2s 48.3s 48.3s 48.5s
>>>>>
>>>>> As you can see, there is no significant difference for CFQ
>>>>> scheduler.
>>>> Thanks Jerome.
>>>>
>>>>> There is big improvement for noop and deadline schedulers
>>>>> (why is that happening?).
>>>> I think because now related IO is in a single queue and it gets to run
>>>> for 100ms or so (like CFQ). So previously, IO from both the instances
>>>> will go into a single queue which should lead to more seeks as requests
>>>> from two groups will kind of get interleaved.
>>>>
>>>> With io controller, both groups have separate queues so requests from
>>>> both the data based instances will not get interleaved (This almost
>>>> becomes like CFQ where ther are separate queues for each io context
>>>> and for sequential reader, one io context gets to run nicely for certain
>>>> ms based on its priority).
>>>>
>>>>> The performance with anticipatory scheduler
>>>>> is a bit lower (~4%).
>>>>>
>>> Hi Jerome,
>>>
>>> Can you also run the AS test with io controller patches and both the
>>> database in root group (basically don't put them in to separate group). I
>>> suspect that this regression might come from that fact that we now have
>>> to switch between queues and in AS we wait for request to finish from
>>> previous queue before next queue is scheduled in and probably that is
>>> slowing down things a bit.., just a wild guess..
>>>
>> Hi Vivek,
>>
>> I guess that's not the reason. I got 46.6s for both DB in root group with
>> io-controller v9 patches. I also rerun the test with DB in different groups
>> and found about the same result as above (48.3s and 48.6s).
>>
>
> Hi Jerome,
>
> Ok, so when both the DB's are in root group (with io-controller V9
> patches), then you get 46.6 seconds time for both the DBs. That means there
> is no regression in this case. In this case there is only one queue of
> root group and AS is running timed read/write batches on this queue.
>
> But when both the DBs are put in separate groups then you get 48.3 and
> 48.6 seconds respectively and we see regression. In this case there are
> two queues belonging to each group. Elevator layer takes care of queue
> group queue switch and AS runs timed read/write batches on these queues.
>
> If it is correct, then it does not exclude the possiblity that it is queue
> switching overhead between groups?

Yes it's correct. I misunderstood you.

Jerome

>
> Thanks
> Vivek

2009-09-11 14:56:39

by Jerome Marchand

[permalink] [raw]

Subject: Re: [RFC] IO scheduler based IO controller V9

Vivek Goyal wrote:
> On Fri, Sep 11, 2009 at 10:30:40AM -0400, Vivek Goyal wrote:
>> On Fri, Sep 11, 2009 at 03:16:23PM +0200, Jerome Marchand wrote:
>>> Vivek Goyal wrote:
>>>> On Thu, Sep 10, 2009 at 04:52:27PM -0400, Vivek Goyal wrote:
>>>>> On Thu, Sep 10, 2009 at 05:18:25PM +0200, Jerome Marchand wrote:
>>>>>> Vivek Goyal wrote:
>>>>>>> Hi All,
>>>>>>>
>>>>>>> Here is the V9 of the IO controller patches generated on top of 2.6.31-rc7.
>>>>>>
>>>>>> Hi Vivek,
>>>>>>
>>>>>> I've run some postgresql benchmarks for io-controller. Tests have been
>>>>>> made with 2.6.31-rc6 kernel, without io-controller patches (when
>>>>>> relevant) and with io-controller v8 and v9 patches.
>>>>>> I set up two instances of the TPC-H database, each running in their
>>>>>> own io-cgroup. I ran two clients to these databases and tested on each
>>>>>> that simple request:
>>>>>> $ select count(*) from LINEITEM;
>>>>>> where LINEITEM is the biggest table of TPC-H (6001215 entries,
>>>>>> 720MB). That request generates a steady stream of IOs.
>>>>>>
>>>>>> Time is measure by psql (\timing switched on). Each test is run twice
>>>>>> or more if there is any significant difference between the first two
>>>>>> runs. Before each run, the cache is flush:
>>>>>> $ echo 3 > /proc/sys/vm/drop_caches
>>>>>>
>>>>>>
>>>>>> Results with 2 groups of same io policy (BE) and same io weight (1000):
>>>>>>
>>>>>> w/o io-scheduler io-scheduler v8 io-scheduler v9
>>>>>> first second first second first second
>>>>>> DB DB DB DB DB DB
>>>>>>
>>>>>> CFQ 48.4s 48.4s 48.2s 48.2s 48.1s 48.5s
>>>>>> Noop 138.0s 138.0s 48.3s 48.4s 48.5s 48.8s
>>>>>> AS 46.3s 47.0s 48.5s 48.7s 48.3s 48.5s
>>>>>> Deadl. 137.1s 137.1s 48.2s 48.3s 48.3s 48.5s
>>>>>>
>>>>>> As you can see, there is no significant difference for CFQ
>>>>>> scheduler.
>>>>> Thanks Jerome.
>>>>>
>>>>>> There is big improvement for noop and deadline schedulers
>>>>>> (why is that happening?).
>>>>> I think because now related IO is in a single queue and it gets to run
>>>>> for 100ms or so (like CFQ). So previously, IO from both the instances
>>>>> will go into a single queue which should lead to more seeks as requests
>>>>> from two groups will kind of get interleaved.
>>>>>
>>>>> With io controller, both groups have separate queues so requests from
>>>>> both the data based instances will not get interleaved (This almost
>>>>> becomes like CFQ where ther are separate queues for each io context
>>>>> and for sequential reader, one io context gets to run nicely for certain
>>>>> ms based on its priority).
>>>>>
>>>>>> The performance with anticipatory scheduler
>>>>>> is a bit lower (~4%).
>>>>>>
>>>> Hi Jerome,
>>>>
>>>> Can you also run the AS test with io controller patches and both the
>>>> database in root group (basically don't put them in to separate group). I
>>>> suspect that this regression might come from that fact that we now have
>>>> to switch between queues and in AS we wait for request to finish from
>>>> previous queue before next queue is scheduled in and probably that is
>>>> slowing down things a bit.., just a wild guess..
>>>>
>>> Hi Vivek,
>>>
>>> I guess that's not the reason. I got 46.6s for both DB in root group with
>>> io-controller v9 patches. I also rerun the test with DB in different groups
>>> and found about the same result as above (48.3s and 48.6s).
>>>
>> Hi Jerome,
>>
>> Ok, so when both the DB's are in root group (with io-controller V9
>> patches), then you get 46.6 seconds time for both the DBs. That means there
>> is no regression in this case. In this case there is only one queue of
>> root group and AS is running timed read/write batches on this queue.
>>
>> But when both the DBs are put in separate groups then you get 48.3 and
>> 48.6 seconds respectively and we see regression. In this case there are
>> two queues belonging to each group. Elevator layer takes care of queue
>> group queue switch and AS runs timed read/write batches on these queues.
>>
>> If it is correct, then it does not exclude the possiblity that it is queue
>> switching overhead between groups?
>>
>
> Does your hard drive support command queuing? May be we are driving deeper
> queue depths for reads and during queue switch we will wait for requests
> to finish from last queue to finish before next queue is scheduled in (for
> AS) and that probably will cause more delay if we are driving deeper queue
> depth.
>
> Can you please set queue depth to "1" (/sys/block/<disk>/device/queue_depth) on
> this disk and see time consumed in two cases are same or different. I think
> setting depth to "1" will bring down overall throughput but if times are same
> in two cases, at least we will know where the delay is coming from.
>
> Thanks
> Vivek

It looks like command queuing is supported but disabled. Queue depth is already 1
and the file /sys/block/<disk>/device/queue_depth is read-only.

Jerome

2009-09-11 15:02:18

by Vivek Goyal

[permalink] [raw]

Subject: Re: [RFC] IO scheduler based IO controller V9

On Fri, Sep 11, 2009 at 04:55:50PM +0200, Jerome Marchand wrote:
> Vivek Goyal wrote:
> > On Fri, Sep 11, 2009 at 10:30:40AM -0400, Vivek Goyal wrote:
> >> On Fri, Sep 11, 2009 at 03:16:23PM +0200, Jerome Marchand wrote:
> >>> Vivek Goyal wrote:
> >>>> On Thu, Sep 10, 2009 at 04:52:27PM -0400, Vivek Goyal wrote:
> >>>>> On Thu, Sep 10, 2009 at 05:18:25PM +0200, Jerome Marchand wrote:
> >>>>>> Vivek Goyal wrote:
> >>>>>>> Hi All,
> >>>>>>>
> >>>>>>> Here is the V9 of the IO controller patches generated on top of 2.6.31-rc7.
> >>>>>>
> >>>>>> Hi Vivek,
> >>>>>>
> >>>>>> I've run some postgresql benchmarks for io-controller. Tests have been
> >>>>>> made with 2.6.31-rc6 kernel, without io-controller patches (when
> >>>>>> relevant) and with io-controller v8 and v9 patches.
> >>>>>> I set up two instances of the TPC-H database, each running in their
> >>>>>> own io-cgroup. I ran two clients to these databases and tested on each
> >>>>>> that simple request:
> >>>>>> $ select count(*) from LINEITEM;
> >>>>>> where LINEITEM is the biggest table of TPC-H (6001215 entries,
> >>>>>> 720MB). That request generates a steady stream of IOs.
> >>>>>>
> >>>>>> Time is measure by psql (\timing switched on). Each test is run twice
> >>>>>> or more if there is any significant difference between the first two
> >>>>>> runs. Before each run, the cache is flush:
> >>>>>> $ echo 3 > /proc/sys/vm/drop_caches
> >>>>>>
> >>>>>>
> >>>>>> Results with 2 groups of same io policy (BE) and same io weight (1000):
> >>>>>>
> >>>>>> w/o io-scheduler io-scheduler v8 io-scheduler v9
> >>>>>> first second first second first second
> >>>>>> DB DB DB DB DB DB
> >>>>>>
> >>>>>> CFQ 48.4s 48.4s 48.2s 48.2s 48.1s 48.5s
> >>>>>> Noop 138.0s 138.0s 48.3s 48.4s 48.5s 48.8s
> >>>>>> AS 46.3s 47.0s 48.5s 48.7s 48.3s 48.5s
> >>>>>> Deadl. 137.1s 137.1s 48.2s 48.3s 48.3s 48.5s
> >>>>>>
> >>>>>> As you can see, there is no significant difference for CFQ
> >>>>>> scheduler.
> >>>>> Thanks Jerome.
> >>>>>
> >>>>>> There is big improvement for noop and deadline schedulers
> >>>>>> (why is that happening?).
> >>>>> I think because now related IO is in a single queue and it gets to run
> >>>>> for 100ms or so (like CFQ). So previously, IO from both the instances
> >>>>> will go into a single queue which should lead to more seeks as requests
> >>>>> from two groups will kind of get interleaved.
> >>>>>
> >>>>> With io controller, both groups have separate queues so requests from
> >>>>> both the data based instances will not get interleaved (This almost
> >>>>> becomes like CFQ where ther are separate queues for each io context
> >>>>> and for sequential reader, one io context gets to run nicely for certain
> >>>>> ms based on its priority).
> >>>>>
> >>>>>> The performance with anticipatory scheduler
> >>>>>> is a bit lower (~4%).
> >>>>>>
> >>>> Hi Jerome,
> >>>>
> >>>> Can you also run the AS test with io controller patches and both the
> >>>> database in root group (basically don't put them in to separate group). I
> >>>> suspect that this regression might come from that fact that we now have
> >>>> to switch between queues and in AS we wait for request to finish from
> >>>> previous queue before next queue is scheduled in and probably that is
> >>>> slowing down things a bit.., just a wild guess..
> >>>>
> >>> Hi Vivek,
> >>>
> >>> I guess that's not the reason. I got 46.6s for both DB in root group with
> >>> io-controller v9 patches. I also rerun the test with DB in different groups
> >>> and found about the same result as above (48.3s and 48.6s).
> >>>
> >> Hi Jerome,
> >>
> >> Ok, so when both the DB's are in root group (with io-controller V9
> >> patches), then you get 46.6 seconds time for both the DBs. That means there
> >> is no regression in this case. In this case there is only one queue of
> >> root group and AS is running timed read/write batches on this queue.
> >>
> >> But when both the DBs are put in separate groups then you get 48.3 and
> >> 48.6 seconds respectively and we see regression. In this case there are
> >> two queues belonging to each group. Elevator layer takes care of queue
> >> group queue switch and AS runs timed read/write batches on these queues.
> >>
> >> If it is correct, then it does not exclude the possiblity that it is queue
> >> switching overhead between groups?
> >>
> >
> > Does your hard drive support command queuing? May be we are driving deeper
> > queue depths for reads and during queue switch we will wait for requests
> > to finish from last queue to finish before next queue is scheduled in (for
> > AS) and that probably will cause more delay if we are driving deeper queue
> > depth.
> >
> > Can you please set queue depth to "1" (/sys/block/<disk>/device/queue_depth) on
> > this disk and see time consumed in two cases are same or different. I think
> > setting depth to "1" will bring down overall throughput but if times are same
> > in two cases, at least we will know where the delay is coming from.
> >
> > Thanks
> > Vivek
>
> It looks like command queuing is supported but disabled. Queue depth is already 1
> and the file /sys/block/<disk>/device/queue_depth is read-only.

Hmm..., time to run blktraces and in both the cases and compare the two
cases and see what's the issue.

Would be great if you can capture and look at traces. Otherwise I will try
to do it sometime soon..

Thanks
Vivek

2009-09-13 18:56:22

by Vivek Goyal

[permalink] [raw]

Subject: Re: [RFC] IO scheduler based IO controller V9

2009-09-14 14:27:04

by Jerome Marchand

[permalink] [raw]

Subject: Re: [RFC] IO scheduler based IO controller V9

Vivek Goyal wrote:
> On Thu, Sep 10, 2009 at 05:18:25PM +0200, Jerome Marchand wrote:
>> Vivek Goyal wrote:
>>> Hi All,
>>>
>>> Here is the V9 of the IO controller patches generated on top of 2.6.31-rc7.
>>
>> Hi Vivek,
>>
>> I've run some postgresql benchmarks for io-controller. Tests have been
>> made with 2.6.31-rc6 kernel, without io-controller patches (when
>> relevant) and with io-controller v8 and v9 patches.
>> I set up two instances of the TPC-H database, each running in their
>> own io-cgroup. I ran two clients to these databases and tested on each
>> that simple request:
>> $ select count(*) from LINEITEM;
>> where LINEITEM is the biggest table of TPC-H (6001215 entries,
>> 720MB). That request generates a steady stream of IOs.
>>
>> Time is measure by psql (\timing switched on). Each test is run twice
>> or more if there is any significant difference between the first two
>> runs. Before each run, the cache is flush:
>> $ echo 3 > /proc/sys/vm/drop_caches
>>
>>
>> Results with 2 groups of same io policy (BE) and same io weight (1000):
>>
>> w/o io-scheduler io-scheduler v8 io-scheduler v9
>> first second first second first second
>> DB DB DB DB DB DB
>>
>> CFQ 48.4s 48.4s 48.2s 48.2s 48.1s 48.5s
>> Noop 138.0s 138.0s 48.3s 48.4s 48.5s 48.8s
>> AS 46.3s 47.0s 48.5s 48.7s 48.3s 48.5s
>> Deadl. 137.1s 137.1s 48.2s 48.3s 48.3s 48.5s
>>
>> As you can see, there is no significant difference for CFQ
>> scheduler.
>
> Thanks Jerome.
>
>> There is big improvement for noop and deadline schedulers
>> (why is that happening?).
>
> I think because now related IO is in a single queue and it gets to run
> for 100ms or so (like CFQ). So previously, IO from both the instances
> will go into a single queue which should lead to more seeks as requests
> from two groups will kind of get interleaved.
>
> With io controller, both groups have separate queues so requests from
> both the data based instances will not get interleaved (This almost
> becomes like CFQ where ther are separate queues for each io context
> and for sequential reader, one io context gets to run nicely for certain
> ms based on its priority).
>
>> The performance with anticipatory scheduler
>> is a bit lower (~4%).
>>
>
> I will run some tests with AS and see if I can reproduce this lower
> performance and attribute it to a particular piece of code.
>
>> Results with 2 groups of same io policy (BE), different io weights and
>> CFQ scheduler:
>> io-scheduler v8 io-scheduler v9
>> weights = 1000, 500 35.6s 46.7s 35.6s 46.7s
>> weigths = 1000, 250 29.2s 45.8s 29.2s 45.6s
>>
>> The result in term of fairness is close to what we can expect from the
>> ideal theoric case: with io weights of 1000 and 500 (1000 and 250),
>> the first request get 2/3 (4/5) of io time as long as it runs and thus
>> finish in about 3/4 (5/8) of total time.
>>
>
> Jerome, after 36.6 seconds, disk will be fully given to second group.
> Hence these times might not reflect the accurate measure of who got how
> much of disk time.

I know and took it into account. Let me detail my calculations.

Both request are of the same size and takes alone a time T to complete
(about 22.5s in our example). For sake of simplification, let's ignore
switching cost. Then, the completion time of both requests running at the
same time would be 2T, whatever are their weights or classes.
If one group weights 1000 and the other 500 (resp. 250), the first group
gets 2/3 (4/5) of the bandwidth as long as it is running, and thus finished
in T/(2/3) = 2T*3/4 (resp. T/(4/5) = 2T*5/8 ) that is 3/4 (5/8) of the
total time. The other always finish in about 2T.

The actual results above are pretty closed to these theoretical values and
that how I concluded that the controller is pretty fair.

>
> Can you just capture the output of "io.disk_time" file in both the cgroups
> at the time of completion of task in higher weight group. Alternatively,
> you can just run this a script in a loop which prints the output of
> "cat io.disk_time | grep major:minor" every 2 seconds. That way we can
> see how disk times are being distributed between groups.

Actually, I already check that and the result was good but I didn't keep
the output, so I just rerun the (1000,500) weights. First column is the
time spent by first group since last refresh (refresh period is 2s).
The second column is the same for second group. The group test3 is not
used. The first "ratios" column is the ratio between io time spent by first
group and time spent by second group.

$ ./watch_cgroup.sh 2
test1: 0 test2: 0 test3: 0 ratios: -- -- --
test1: 805 test2: 301 test3: 0 ratios: 2.67441860465116279069 -- --
test1: 1209 test2: 714 test3: 0 ratios: 1.69327731092436974789 -- --
test1: 1306 test2: 503 test3: 0 ratios: 2.59642147117296222664 -- --
test1: 1210 test2: 604 test3: 0 ratios: 2.00331125827814569536 -- --
test1: 1207 test2: 605 test3: 0 ratios: 1.99504132231404958677 -- --
test1: 1209 test2: 605 test3: 0 ratios: 1.99834710743801652892 -- --
test1: 1206 test2: 606 test3: 0 ratios: 1.99009900990099009900 -- --
test1: 1109 test2: 607 test3: 0 ratios: 1.82701812191103789126 -- --
test1: 1213 test2: 603 test3: 0 ratios: 2.01160862354892205638 -- --
test1: 1214 test2: 608 test3: 0 ratios: 1.99671052631578947368 -- --
test1: 1211 test2: 603 test3: 0 ratios: 2.00829187396351575456 -- --
test1: 1110 test2: 603 test3: 0 ratios: 1.84079601990049751243 -- --
test1: 1210 test2: 605 test3: 0 ratios: 2.00000000000000000000 -- --
test1: 1211 test2: 601 test3: 0 ratios: 2.01497504159733777038 -- --
test1: 1210 test2: 607 test3: 0 ratios: 1.99341021416803953871 -- --
test1: 1204 test2: 604 test3: 0 ratios: 1.99337748344370860927 -- --
test1: 1207 test2: 605 test3: 0 ratios: 1.99504132231404958677 -- --
test1: 1089 test2: 708 test3: 0 ratios: 1.53813559322033898305 -- --
test1: 0 test2: 2124 test3: 0 ratios: 0 -- --
test1: 0 test2: 1915 test3: 0 ratios: 0 -- --
test1: 0 test2: 1919 test3: 0 ratios: 0 -- --
test1: 0 test2: 2023 test3: 0 ratios: 0 -- --
test1: 0 test2: 1925 test3: 0 ratios: 0 -- --
test1: 0 test2: 705 test3: 0 ratios: 0 -- --
test1: 0 test2: 0 test3: 0 ratios: -- -- --

As you can see, the ratio stays close to 2 as long as first request is
running.

Regards,
Jerome

>
>> Results with 2 groups of different io policies, same io weight and
>> CFQ scheduler:
>> io-scheduler v8 io-scheduler v9
>> policy = RT, BE 22.5s 45.3s 22.4s 45.0s
>> policy = BE, IDLE 22.6s 44.8s 22.4s 45.0s
>>
>> Here again, the result in term of fairness is very close from what we
>> expect.
>
> Same as above in this case too.
>
> These seem to be good test for fairness measurement in case of streaming
> readers. I think one more interesting test case will be do how are the
> random read latencies in case of multiple streaming readers present.
>
> So if we can launch 4-5 dd processes in one group and then issue some
> random small queueries on postgresql in second group, I am keen to see
> how quickly the query can be completed with and without io controller.
> Would be interesting to see at results for all 4 io schedulers.
>
> Thanks
> Vivek

2009-09-14 14:31:57

by Jerome Marchand

[permalink] [raw]