LinuxLists.cc - cfq performance gap

2006-12-08 00:04:55

Subject: cfq performance gap

Hi Jens,

I've noticed a performance gap between the cfq scheduler and other io
schedulers when running the rawio benchmark.
Results from rawio on 2.6.19, cfq and noop schedulers:

CFQ:

procs device num read KB/sec I/O Ops/sec
----- --------------- ---------- ------- --------------
16 /dev/sda 16412 8338 2084
----- --------------- ---------- ------- --------------
16 16412 8338 2084

Total run time 0.492072 seconds

NOOP:

procs device num read KB/sec I/O Ops/sec
----- --------------- ---------- ------- --------------
16 /dev/sda 16399 29224 7306
----- --------------- ---------- ------- --------------
16 16399 29224 7306

Total run time 0.140284 seconds

The benchmark workload is 16 processes running 4k random reads.

Is this performance gap a known issue?
Thanks,
Avantika Mathur

2006-12-08 12:04:43

by Jens Axboe

[permalink] [raw]

Subject: Re: cfq performance gap

On Thu, Dec 07 2006, Avantika Mathur wrote:
> Hi Jens,

(you probably noticed now, but the [email protected] email is no longer
valid)

> I've noticed a performance gap between the cfq scheduler and other io
> schedulers when running the rawio benchmark.
> Results from rawio on 2.6.19, cfq and noop schedulers:
>
> CFQ:
>
> procs device num read KB/sec I/O Ops/sec
> ----- --------------- ---------- ------- --------------
> 16 /dev/sda 16412 8338 2084
> ----- --------------- ---------- ------- --------------
> 16 16412 8338 2084
>
> Total run time 0.492072 seconds
>
>
> NOOP:
>
> procs device num read KB/sec I/O Ops/sec
> ----- --------------- ---------- ------- --------------
> 16 /dev/sda 16399 29224 7306
> ----- --------------- ---------- ------- --------------
> 16 16399 29224 7306
>
> Total run time 0.140284 seconds
>
> The benchmark workload is 16 processes running 4k random reads.
>
> Is this performance gap a known issue?

CFQ could be a little slower at this benchmark, but your results are
much worse than I would expect. What is the queueing depth of sda? How
are you invoking rawio?

Your runtime is very low, how does it look if you allow the test to run
for much longer? 30MiB/sec random read bandwidth seems very high, I'm
wondering what exactly is being tested here.

--
Jens Axboe

2006-12-08 22:11:35

by Avantika Mathur LTC

[permalink] [raw]

Subject: Re: cfq performance gap

On Fri, 2006-12-08 at 13:05 +0100, Jens Axboe wrote:
> On Thu, Dec 07 2006, Avantika Mathur wrote:
> > Hi Jens,
>
> (you probably noticed now, but the [email protected] email is no longer
> valid)

I saw that, thanks!
> > I've noticed a performance gap between the cfq scheduler and other io
> > schedulers when running the rawio benchmark.
> > Results from rawio on 2.6.19, cfq and noop schedulers:
> >
> > CFQ:
> >
> > procs device num read KB/sec I/O Ops/sec
> > ----- --------------- ---------- ------- --------------
> > 16 /dev/sda 16412 8338 2084
> > ----- --------------- ---------- ------- --------------
> > 16 16412 8338 2084
> >
> > Total run time 0.492072 seconds
> >
> >
> > NOOP:
> >
> > procs device num read KB/sec I/O Ops/sec
> > ----- --------------- ---------- ------- --------------
> > 16 /dev/sda 16399 29224 7306
> > ----- --------------- ---------- ------- --------------
> > 16 16399 29224 7306
> >
> > Total run time 0.140284 seconds
> >
> > The benchmark workload is 16 processes running 4k random reads.
> >
> > Is this performance gap a known issue?
>
> CFQ could be a little slower at this benchmark, but your results are
> much worse than I would expect. What is the queueing depth of sda? How
> are you invoking rawio?

I am running rawio with the following options:
rawread -p 16 -m 1 -d 1 -x -z -t 0 -s 4096

The queue depth on sda is 4.

>
> Your runtime is very low, how does it look if you allow the test to run
> for much longer? 30MiB/sec random read bandwidth seems very high, I'm
> wondering what exactly is being tested here.
>

rawio is actually performing sequential reads, but I don't believe it is
purely sequential with the multiple processes.
I am currently running the test with longer runtimes and will post
results once it is complete.
I've also attached the rawio source.

Thanks,
Avantika

Attachments:

rawio-2.4.2.tar.gz (12.79 kB)

2006-12-11 14:07:34

by Jens Axboe

[permalink] [raw]

Subject: Re: cfq performance gap

On Fri, Dec 08 2006, Avantika Mathur wrote:
> On Fri, 2006-12-08 at 13:05 +0100, Jens Axboe wrote:
> > On Thu, Dec 07 2006, Avantika Mathur wrote:
> > > Hi Jens,
> >
> > (you probably noticed now, but the [email protected] email is no longer
> > valid)
>
> I saw that, thanks!
> > > I've noticed a performance gap between the cfq scheduler and other io
> > > schedulers when running the rawio benchmark.
> > > Results from rawio on 2.6.19, cfq and noop schedulers:
> > >
> > > CFQ:
> > >
> > > procs device num read KB/sec I/O Ops/sec
> > > ----- --------------- ---------- ------- --------------
> > > 16 /dev/sda 16412 8338 2084
> > > ----- --------------- ---------- ------- --------------
> > > 16 16412 8338 2084
> > >
> > > Total run time 0.492072 seconds
> > >
> > >
> > > NOOP:
> > >
> > > procs device num read KB/sec I/O Ops/sec
> > > ----- --------------- ---------- ------- --------------
> > > 16 /dev/sda 16399 29224 7306
> > > ----- --------------- ---------- ------- --------------
> > > 16 16399 29224 7306
> > >
> > > Total run time 0.140284 seconds
> > >
> > > The benchmark workload is 16 processes running 4k random reads.
> > >
> > > Is this performance gap a known issue?
> >
> > CFQ could be a little slower at this benchmark, but your results are
> > much worse than I would expect. What is the queueing depth of sda? How
> > are you invoking rawio?
>
> I am running rawio with the following options:
> rawread -p 16 -m 1 -d 1 -x -z -t 0 -s 4096
>
> The queue depth on sda is 4.
>
> >
> > Your runtime is very low, how does it look if you allow the test to run
> > for much longer? 30MiB/sec random read bandwidth seems very high, I'm
> > wondering what exactly is being tested here.
> >
>
> rawio is actually performing sequential reads, but I don't believe it is
> purely sequential with the multiple processes.
> I am currently running the test with longer runtimes and will post
> results once it is complete.
> I've also attached the rawio source.

It's certainly the slice and idling hurting here. But at the same time,
I don't really think your test case is very interesting. The test area
is very small and you have 16 threads trying to read the same thing,
optimizing for that would be silly as I don't think it has much real
world relevance.

That said, I might add some logic to detect when we can cheaply switch
queues instead of waiting for a new request from the same queue.
Averaging slice times over a period of time instead of 1:1 with that
logic, should help cases like this while still being fair.

--
Jens Axboe

2006-12-13 01:32:51

by Avantika Mathur

[permalink] [raw]

Subject: Re: cfq performance gap

Jens Axboe wrote:
> On Fri, Dec 08 2006, Avantika Mathur wrote:
>
>> On Fri, 2006-12-08 at 13:05 +0100, Jens Axboe wrote:
>>
>>> On Thu, Dec 07 2006, Avantika Mathur wrote:
>>>
>>>> Hi Jens,
>>>>
>>> (you probably noticed now, but the [email protected] email is no longer
>>> valid)
>>>
>> I saw that, thanks!
>>
>>>> I've noticed a performance gap between the cfq scheduler and other io
>>>> schedulers when running the rawio benchmark.
>>>>
>>>> The benchmark workload is 16 processes running 4k random reads.
>>>>
>>>> Is this performance gap a known issue?
>>>>
>>> CFQ could be a little slower at this benchmark, but your results are
>>> much worse than I would expect. What is the queueing depth of sda? How
>>> are you invoking rawio?
>>>
>> I am running rawio with the following options:
>> rawread -p 16 -m 1 -d 1 -x -z -t 0 -s 4096
>>
>> The queue depth on sda is 4.
>>
>>> Your runtime is very low, how does it look if you allow the test to run
>>> for much longer? 30MiB/sec random read bandwidth seems very high, I'm
>>> wondering what exactly is being tested here.
>>>
>> rawio is actually performing sequential reads, but I don't believe it is
>> purely sequential with the multiple processes.
>> I am currently running the test with longer runtimes and will post
>> results once it is complete.
>> I've also attached the rawio source.
>>
>
> It's certainly the slice and idling hurting here. But at the same time,
> I don't really think your test case is very interesting. The test area
> is very small and you have 16 threads trying to read the same thing,
> optimizing for that would be silly as I don't think it has much real
> world relevance.
>
>
Could a database have similar workload to this test?
> That said, I might add some logic to detect when we can cheaply switch
> queues instead of waiting for a new request from the same queue.
> Averaging slice times over a period of time instead of 1:1 with that
> logic, should help cases like this while still being fair.
>
Thank you for looking at this issue.
I've found an IBM/SUSE bugzilla bug for the same performance gap on
rawio. There was a fix for this bug included in SLES10-RC1, do you know
why it was not added in mainline?

Thanks again,
Avantika Mathur

2006-12-13 05:33:11

by Chen, Kenneth W

[permalink] [raw]

Subject: RE: cfq performance gap

AVANTIKA R. MATHUR wrote on Tuesday, December 12, 2006 5:33 PM
> >> rawio is actually performing sequential reads, but I don't believe it is
> >> purely sequential with the multiple processes.
> >> I am currently running the test with longer runtimes and will post
> >> results once it is complete.
> >> I've also attached the rawio source.
> >>
> >
> > It's certainly the slice and idling hurting here. But at the same time,
> > I don't really think your test case is very interesting. The test area
> > is very small and you have 16 threads trying to read the same thing,
> > optimizing for that would be silly as I don't think it has much real
> > world relevance.
>
> Could a database have similar workload to this test?

No.

Not what I have seen with db workloads exhibits such pattern. There are
basically two types of db workloads: one does transaction processing, and
I/O pattern are truly random with large stride, both in the context of
process and overall I/O seen at device level. A second one is decision
making type of db queries. They does large sequential I/O within one
process context.

This rawio test plows through sequential I/O and modulo each small record
over number of threads. So each thread appears to be non-contiguous within
its own process context, overall request hitting the device are sequential.
I can't see how any application does that kind of I/O pattern.

- Ken

2006-12-13 07:19:36

by Jens Axboe

[permalink] [raw]

Subject: Re: cfq performance gap

On Tue, Dec 12 2006, AVANTIKA R. MATHUR wrote:
> >That said, I might add some logic to detect when we can cheaply switch
> >queues instead of waiting for a new request from the same queue.
> >Averaging slice times over a period of time instead of 1:1 with that
> >logic, should help cases like this while still being fair.
> >
> Thank you for looking at this issue.
> I've found an IBM/SUSE bugzilla bug for the same performance gap on
> rawio. There was a fix for this bug included in SLES10-RC1, do you know
> why it was not added in mainline?

Which bug do you mean? It was likely me doing the fixing on that bug,
and I'm certain that the patch is in mainline. If you included the bug
number, I could have expanded on that.

--
Jens Axboe

2006-12-13 10:06:36

by Miquel van Smoorenburg

[permalink] [raw]

Subject: Re: cfq performance gap

In article <[email protected]>,
Chen, Kenneth W <[email protected]> wrote:
>This rawio test plows through sequential I/O and modulo each small record
>over number of threads. So each thread appears to be non-contiguous within
>its own process context, overall request hitting the device are sequential.
>I can't see how any application does that kind of I/O pattern.

A NNTP server that has many incoming connections, handled by
multiple threads, that stores the data in cylic buffers ?

Mike.

2006-12-13 16:20:57

by Chen, Kenneth W

[permalink] [raw]

Subject: RE: cfq performance gap

Miquel van Smoorenburg wrote on Wednesday, December 13, 2006 1:57 AM
> Chen, Kenneth W <[email protected]> wrote:
> >This rawio test plows through sequential I/O and modulo each small record
> >over number of threads. So each thread appears to be non-contiguous within
> >its own process context, overall request hitting the device are sequential.
> >I can't see how any application does that kind of I/O pattern.
>
> A NNTP server that has many incoming connections, handled by
> multiple threads, that stores the data in cylic buffers ?

Then whichever the thread that dumps the buffer content to the storage
will do one large contiguous I/O.

2006-12-13 16:41:39

by Miquel van Smoorenburg

[permalink] [raw]

Subject: Re: cfq performance gap

In article <[email protected]>,
Chen, Kenneth W <[email protected]> wrote:
>Miquel van Smoorenburg wrote on Wednesday, December 13, 2006 1:57 AM
>> Chen, Kenneth W <[email protected]> wrote:
>> >This rawio test plows through sequential I/O and modulo each small record
>> >over number of threads. So each thread appears to be non-contiguous within
>> >its own process context, overall request hitting the device are sequential.
>> >I can't see how any application does that kind of I/O pattern.
>>
>> A NNTP server that has many incoming connections, handled by
>> multiple threads, that stores the data in cylic buffers ?
>
>Then whichever the thread that dumps the buffer content to the storage
>will do one large contiguous I/O.

In this context, "cyclic buffer" means "large fixed-size file" or
"disk partition", and when the end of that file/partition is reached,
writing resumes at the start (wraps around, starts the next cycle).

Each thread writes an article to disk, which can differ in size
from 1K to 1M. The writes all together are sequential, but the writes
from one thread are definitely not.

This is a real-world example - I have written software that does
exactly this, multithreaded versions of INN exist that with CNFS
storage does exactly this, and Diablo does something comparable
(only it uses processes instead of threads).

Mike.