Hi,
I took time and remeasured tiobench results on recent kernel. A short
conclusion is that there is still a performance regression which I reported
few months ago. The machine is Intel 2 CPU with 2 GB RAM and plain SATA
drive. tiobench sequential write performance numbers with 16 threads:
2.6.29: AVG STDERR
37.80 38.54 39.48 -> 38.606667 0.687475
2.6.32-rc5:
37.36 36.41 36.61 -> 36.793333 0.408928
So about 5% regression. The regression happened sometime between 2.6.29 and
2.6.30 and stays the same since then... With deadline scheduler, there's
no regression. Shouldn't we do something about it?
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
Jan Kara <[email protected]> writes:
> Hi,
>
> I took time and remeasured tiobench results on recent kernel. A short
> conclusion is that there is still a performance regression which I reported
> few months ago. The machine is Intel 2 CPU with 2 GB RAM and plain SATA
> drive. tiobench sequential write performance numbers with 16 threads:
> 2.6.29: AVG STDERR
> 37.80 38.54 39.48 -> 38.606667 0.687475
>
> 2.6.32-rc5:
> 37.36 36.41 36.61 -> 36.793333 0.408928
>
> So about 5% regression. The regression happened sometime between 2.6.29 and
> 2.6.30 and stays the same since then... With deadline scheduler, there's
> no regression. Shouldn't we do something about it?
Background:
http://lkml.org/lkml/2009/5/28/415
Thanks for bringing this up again. I'll try to make some time to look
into it if others don't beat me to it.
Cheers,
Jeff
Jan Kara <[email protected]> writes:
> Hi,
>
> I took time and remeasured tiobench results on recent kernel. A short
> conclusion is that there is still a performance regression which I reported
> few months ago. The machine is Intel 2 CPU with 2 GB RAM and plain SATA
> drive. tiobench sequential write performance numbers with 16 threads:
> 2.6.29: AVG STDERR
> 37.80 38.54 39.48 -> 38.606667 0.687475
>
> 2.6.32-rc5:
> 37.36 36.41 36.61 -> 36.793333 0.408928
>
> So about 5% regression. The regression happened sometime between 2.6.29 and
> 2.6.30 and stays the same since then... With deadline scheduler, there's
> no regression. Shouldn't we do something about it?
Sorry it took so long, but I've been flat out lately. I ran some
numbers against 2.6.29 and 2.6.32-rc5, both with low_latency set to 0
and to 1. Here are the results (average of two runs):
rlat | rrlat | wlat | rwlat
kernel | Thr | read | randr | write | randw | avg, max | avg, max | avg, max | avg,max
------------------------------------------------------------------------------------------------------------------------
2.6.29 | 8 | 72.95 | 20.06 | 269.66 | 231.59 | 6.625, 1683.66 | 23.241, 1547.97 | 1.761, 698.10 | 0.720, 443.64
| 16 | 72.33 | 20.03 | 278.85 | 228.81 | 13.643, 2499.77 | 46.575, 1717.10 | 3.304, 1149.29 | 1.011, 140.30
------------------------------------------------------------------------------------------------------------------------
2.6.32-rc5 | 8 | 86.58 | 19.80 | 198.82 | 205.06 | 5.694, 977.26 | 22.559, 870.16 | 2.359, 693.88 | 0.530, 24.32
| 16 | 86.82 | 21.10 | 199.00 | 212.02 | 11.010, 1958.78 | 40.195, 1662.35 | 4.679, 1351.27 | 1.007, 25.36
------------------------------------------------------------------------------------------------------------------------
2.6.32-rc5 | 8 | 87.65 | 117.65 | 298.27 | 212.35 | 5.615, 984.89 | 4.060, 97.39 | 1.535, 311.14 | 0.534, 24.29
low_lat=0 | 16 | 95.60 | 119.95*| 302.48 | 213.27 | 10.263, 1750.19 | 13.899, 1006.21 | 3.221, 734.22 | 1.062, 40.40
------------------------------------------------------------------------------------------------------------------------
Legend:
rlat - read latency
rrlat - random read latency
wlat - write lancy
rwlat - random write latency
* - the two runs reported vastly different numbers: 67.53 and 172.46
So, as you can see, if we turn off the low_latency tunable, we get
better numbers across the board with the exception of random writes.
It's also interesting to note that the latencies reported by tiobench
are more favorable with low_latency set to 0, which is
counter-intuitive.
So, now it seems we don't have a regression in sequential read
bandwidth, but we do have a regression in random read bandwidth (though
the random write latencies look better). So, I'll look into that, as it
is almost 10%, which is significant.
Cheers,
Jeff
Hi Jeff,
what hardware are you using for tests?
I see aggregated random read bandwidth is larger than sequential read
bandwidth, and write bandwidth greater than read.
Is this a SAN with multiple independent spindles?
On Thu, Nov 5, 2009 at 9:10 PM, Jeff Moyer <[email protected]> wrote:
> Jan Kara <[email protected]> writes:
>
>> Hi,
>>
>> I took time and remeasured tiobench results on recent kernel. A short
>> conclusion is that there is still a performance regression which I reported
>> few months ago. The machine is Intel 2 CPU with 2 GB RAM and plain SATA
>> drive. tiobench sequential write performance numbers with 16 threads:
>> 2.6.29: AVG STDERR
>> 37.80 38.54 39.48 -> 38.606667 0.687475
>>
>> 2.6.32-rc5:
>> 37.36 36.41 36.61 -> 36.793333 0.408928
>>
>> So about 5% regression. The regression happened sometime between 2.6.29 and
>> 2.6.30 and stays the same since then... With deadline scheduler, there's
>> no regression. Shouldn't we do something about it?
>
> Sorry it took so long, but I've been flat out lately. I ran some
> numbers against 2.6.29 and 2.6.32-rc5, both with low_latency set to 0
> and to 1. Here are the results (average of two runs):
>
> rlat | rrlat | wlat | rwlat
> kernel | Thr | read | randr | write | randw | avg, max | avg, max | avg, max | avg,max
> ------------------------------------------------------------------------------------------------------------------------
> 2.6.29 | 8 | 72.95 | 20.06 | 269.66 | 231.59 | 6.625, 1683.66 | 23.241, 1547.97 | 1.761, 698.10 | 0.720, 443.64
> | 16 | 72.33 | 20.03 | 278.85 | 228.81 | 13.643, 2499.77 | 46.575, 1717.10 | 3.304, 1149.29 | 1.011, 140.30
> ------------------------------------------------------------------------------------------------------------------------
> 2.6.32-rc5 | 8 | 86.58 | 19.80 | 198.82 | 205.06 | 5.694, 977.26 | 22.559, 870.16 | 2.359, 693.88 | 0.530, 24.32
> | 16 | 86.82 | 21.10 | 199.00 | 212.02 | 11.010, 1958.78 | 40.195, 1662.35 | 4.679, 1351.27 | 1.007, 25.36
> ------------------------------------------------------------------------------------------------------------------------
> 2.6.32-rc5 | 8 | 87.65 | 117.65 | 298.27 | 212.35 | 5.615, 984.89 | 4.060, 97.39 | 1.535, 311.14 | 0.534, 24.29
> low_lat=0 | 16 | 95.60 | 119.95*| 302.48 | 213.27 | 10.263, 1750.19 | 13.899, 1006.21 | 3.221, 734.22 | 1.062, 40.40
> ------------------------------------------------------------------------------------------------------------------------
>
> Legend:
> rlat - read latency
> rrlat - random read latency
> wlat - write lancy
> rwlat - random write latency
> * - the two runs reported vastly different numbers: 67.53 and 172.46
>
> So, as you can see, if we turn off the low_latency tunable, we get
> better numbers across the board with the exception of random writes.
> It's also interesting to note that the latencies reported by tiobench
> are more favorable with low_latency set to 0, which is
> counter-intuitive.
>
> So, now it seems we don't have a regression in sequential read
> bandwidth, but we do have a regression in random read bandwidth (though
> the random write latencies look better). So, I'll look into that, as it
> is almost 10%, which is significant.
>
Sorry, I don't see a 10% regression in random read from your numbers.
I see a larger one in sequential write for low_latency=1 (this was
the regression Jan reported in the original message), but not for
low_latency=0. And a 10% regression in random writes, that is not
completely fixed even by disabling low_latency.
I guess your seemingly counter-intuitive results for low_latency are
due to the uncommon hardware (low_latency was intended mainly for
desktop-class disks). Luckily, the patches queued for 2.6.33 already
address this low_latency misbehaviour.
Thanks,
Corrado.
> Cheers,
> Jeff
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
Corrado Zoccolo <[email protected]> writes:
> Hi Jeff,
> what hardware are you using for tests?
> I see aggregated random read bandwidth is larger than sequential read
> bandwidth, and write bandwidth greater than read.
> Is this a SAN with multiple independent spindles?
Yeah, this is a single path to an HP EVA storage array. There are 24 or
so disks striped in the pool used to create the volume I am using. Jan,
could you repeat your tests with /sys/block/sdX/queue/iosched/low_latency
set to 0?
Cheers,
Jeff
Jeff Moyer <[email protected]> writes:
> Jan Kara <[email protected]> writes:
>
>> Hi,
>>
>> I took time and remeasured tiobench results on recent kernel. A short
>> conclusion is that there is still a performance regression which I reported
>> few months ago. The machine is Intel 2 CPU with 2 GB RAM and plain SATA
>> drive. tiobench sequential write performance numbers with 16 threads:
>> 2.6.29: AVG STDERR
>> 37.80 38.54 39.48 -> 38.606667 0.687475
>>
>> 2.6.32-rc5:
>> 37.36 36.41 36.61 -> 36.793333 0.408928
>>
>> So about 5% regression. The regression happened sometime between 2.6.29 and
>> 2.6.30 and stays the same since then... With deadline scheduler, there's
>> no regression. Shouldn't we do something about it?
>
> Sorry it took so long, but I've been flat out lately. I ran some
> numbers against 2.6.29 and 2.6.32-rc5, both with low_latency set to 0
> and to 1. Here are the results (average of two runs):
I modified the tiobench script to do a drop_caches between runs so I
could stop fiddling around with the numbers myself. Extra credit goes
to anyone who hacks it up to report standard deviation.
Anyway, here are the latest results, average of 3 runs each for 2.6.29
and 2.6.32-rc6 with low_latency set to 0. Note that there was a fix in
CFQ that would result in properly preempting the active queue for
metadata I/O.
rlat | rrlat | wlat | rwlat
kernel | Thr | read | randr | write | randw | avg, max | avg, max | avg, max | avg,max
------------------------------------------------------------------------------------------------------------------------
2.6.29 | 8 | 66.43 | 20.52 | 296.32 | 214.17 | 22.330, 3106.47 | 70.026, 2804.02 | 4.817, 2406.65 | 1.420, 349.44
| 16 | 63.28 | 20.45 | 322.65 | 212.77 | 46.457, 5779.14 |137.455, 4982.75 | 8.378, 5408.60 | 2.764, 425.79
------------------------------------------------------------------------------------------------------------------------
2.6.32-rc6 | 8 | 87.66 | 115.22 | 324.19 | 222.18 | 16.677, 3065.81 | 11.834, 194.18 | 4.261, 1212.86 | 1.577, 103.20
low_lat=0 | 16 | 94.06 | 49.65 | 327.06 | 214.74 | 30.318, 5468.20 | 50.947, 1725.15 | 8.271, 1522.95 | 3.064, 89.16
------------------------------------------------------------------------------------------------------------------------
Given those numbers, everything looks ok from a regression perspective.
More investigation should be done for the random read numbers (given
that they fluctuate quite a bit), but that's purely an enhancement at
this point in time.
Just to be sure, I'll kick off 10 runs and make sure the averages fall
out the same way. If you don't hear from me, though, assume this
regression is fixed. The key is to set low_latency to 0 for this
benchmark. We should probably add notes about when to switch off
low_latency to the io scheduler documentation. Jens, would you mind
doing that?
Cheers,
Jeff
On Fri, Nov 6, 2009 at 7:56 PM, Jeff Moyer <[email protected]> wrote:
> Jeff Moyer <[email protected]> writes:
> rlat | rrlat | wlat | rwlat
> kernel | Thr | read | randr | write | randw | avg, max | avg, max | avg, max | avg,max
> ------------------------------------------------------------------------------------------------------------------------
> 2.6.29 | 8 | 66.43 | 20.52 | 296.32 | 214.17 | 22.330, 3106.47 | 70.026, 2804.02 | 4.817, 2406.65 | 1.420, 349.44
> | 16 | 63.28 | 20.45 | 322.65 | 212.77 | 46.457, 5779.14 |137.455, 4982.75 | 8.378, 5408.60 | 2.764, 425.79
> ------------------------------------------------------------------------------------------------------------------------
> 2.6.32-rc6 | 8 | 87.66 | 115.22 | 324.19 | 222.18 | 16.677, 3065.81 | 11.834, 194.18 | 4.261, 1212.86 | 1.577, 103.20
> low_lat=0 | 16 | 94.06 | 49.65 | 327.06 | 214.74 | 30.318, 5468.20 | 50.947, 1725.15 | 8.271, 1522.95 | 3.064, 89.16
> ------------------------------------------------------------------------------------------------------------------------
>
Jeff, Jens,
do you think we should try to do more auto-tuning of cfq parameters?
Looking at those numbers for SANs, I think we are being suboptimal in
some cases.
E.g. sequential read throughput is lower than random read.
In those cases, converting all sync queues in sync-noidle (as defined
in for-2.6.33) should allow a better aggregate throughput when there
are multiple sequential readers, as in those tiobench tests.
I also think that current slice_idle and slice_sync values are good
for devices with 8ms seek time, but they are too high for non-NCQ
flash devices, where "seek" penalty is under 1ms, and we still prefer
idling.
If we agree on this, should the measurement part (I'm thinking to
measure things like seek time, throughput, etc...) be added to the
common elevator code, or done inside cfq?
If we want to put it in the common code, maybe we can also remove the
duplication of NCQ detection, by publishing the NCQ flag from elevator
to the io-schedulers.
Thanks,
Corrado
>
> Cheers,
> Jeff
> --
Corrado Zoccolo <[email protected]> writes:
> Jeff, Jens,
> do you think we should try to do more auto-tuning of cfq parameters?
> Looking at those numbers for SANs, I think we are being suboptimal in
> some cases.
> E.g. sequential read throughput is lower than random read.
I investigated this further, and this was due to a problem in the
benchmark. It was being run with only 500 samples for random I/O and
65536 samples for sequential. After fixing this, we see random I/O is
slower than sequential, as expected.
> I also think that current slice_idle and slice_sync values are good
> for devices with 8ms seek time, but they are too high for non-NCQ
> flash devices, where "seek" penalty is under 1ms, and we still prefer
> idling.
Do you have numbers to back that up? If not, throw a fio job file over
the fence and I'll test it on one such device.
> If we agree on this, should the measurement part (I'm thinking to
> measure things like seek time, throughput, etc...) be added to the
> common elevator code, or done inside cfq?
Well, if it's something that is of interest to others, than pushing it
up a layer makes sense. If only CFQ is going to use it, keep it there.
Cheers,
Jeff
On Tue, Nov 10, 2009 at 5:47 PM, Jeff Moyer <[email protected]> wrote:
> Corrado Zoccolo <[email protected]> writes:
>
>> Jeff, Jens,
>> do you think we should try to do more auto-tuning of cfq parameters?
>> Looking at those numbers for SANs, I think we are being suboptimal in
>> some cases.
>> E.g. sequential read throughput is lower than random read.
>
> I investigated this further, and this was due to a problem in the
> benchmark. It was being run with only 500 samples for random I/O and
> 65536 samples for sequential. After fixing this, we see random I/O is
> slower than sequential, as expected.
Ok.
>> I also think that current slice_idle and slice_sync values are good
>> for devices with 8ms seek time, but they are too high for non-NCQ
>> flash devices, where "seek" penalty is under 1ms, and we still prefer
>> idling.
>
> Do you have numbers to back that up? If not, throw a fio job file over
> the fence and I'll test it on one such device.
>
It is based on reasoning.
Currently idling is based on the assumption that we can wait up to
10ms, to get a better request than jumping far away, since the jump
will likely cost more than that. If the jump costs around 1ms, like on
flash cards, then waiting 10ms is surely wasted time.
On the other hand, on flash cards a random write could cost 50ms or
more, so we will need to differentiate the last idle before switching
to async writes from the inter-read idles. This should be possible
with the new workload based infrastructure, but we need to measure
those characteristic times in order to use them in the heuristics.
>> If we agree on this, should the measurement part (I'm thinking to
>> measure things like seek time, throughput, etc...) be added to the
>> common elevator code, or done inside cfq?
>
> Well, if it's something that is of interest to others, than pushing it
> up a layer makes sense. If only CFQ is going to use it, keep it there.
If the direction is to have only one intelligent I/O scheduler, as the
removal of anticipatory indicates, then it is the latter. I don't
think noop or deadline will ever make any use of them.
But it could still be useful for reporting performance as seen by the
kernel, after the page cache.
Thanks
Corrado
>
> Cheers,
> Jeff
>
On Fri 06-11-09 09:14:53, Jeff Moyer wrote:
> Corrado Zoccolo <[email protected]> writes:
>
> > Hi Jeff,
> > what hardware are you using for tests?
> > I see aggregated random read bandwidth is larger than sequential read
> > bandwidth, and write bandwidth greater than read.
> > Is this a SAN with multiple independent spindles?
>
> Yeah, this is a single path to an HP EVA storage array. There are 24 or
> so disks striped in the pool used to create the volume I am using. Jan,
> could you repeat your tests with /sys/block/sdX/queue/iosched/low_latency
> set to 0?
I'll give it a spin tomorrow...
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Thu 05-11-09 15:10:52, Jeff Moyer wrote:
> Jan Kara <[email protected]> writes:
>
> > Hi,
> >
> > I took time and remeasured tiobench results on recent kernel. A short
> > conclusion is that there is still a performance regression which I reported
> > few months ago. The machine is Intel 2 CPU with 2 GB RAM and plain SATA
> > drive. tiobench sequential write performance numbers with 16 threads:
> > 2.6.29: AVG STDERR
> > 37.80 38.54 39.48 -> 38.606667 0.687475
> >
> > 2.6.32-rc5:
> > 37.36 36.41 36.61 -> 36.793333 0.408928
> >
> > So about 5% regression. The regression happened sometime between 2.6.29 and
> > 2.6.30 and stays the same since then... With deadline scheduler, there's
> > no regression. Shouldn't we do something about it?
>
> Sorry it took so long, but I've been flat out lately. I ran some
> numbers against 2.6.29 and 2.6.32-rc5, both with low_latency set to 0
> and to 1. Here are the results (average of two runs):
>
> rlat | rrlat | wlat | rwlat
> kernel | Thr | read | randr | write | randw | avg, max | avg, max | avg, max | avg,max
> ------------------------------------------------------------------------------------------------------------------------
> 2.6.29 | 8 | 72.95 | 20.06 | 269.66 | 231.59 | 6.625, 1683.66 | 23.241, 1547.97 | 1.761, 698.10 | 0.720, 443.64
> | 16 | 72.33 | 20.03 | 278.85 | 228.81 | 13.643, 2499.77 | 46.575, 1717.10 | 3.304, 1149.29 | 1.011, 140.30
> ------------------------------------------------------------------------------------------------------------------------
> 2.6.32-rc5 | 8 | 86.58 | 19.80 | 198.82 | 205.06 | 5.694, 977.26 | 22.559, 870.16 | 2.359, 693.88 | 0.530, 24.32
> | 16 | 86.82 | 21.10 | 199.00 | 212.02 | 11.010, 1958.78 | 40.195, 1662.35 | 4.679, 1351.27 | 1.007, 25.36
> ------------------------------------------------------------------------------------------------------------------------
> 2.6.32-rc5 | 8 | 87.65 | 117.65 | 298.27 | 212.35 | 5.615, 984.89 | 4.060, 97.39 | 1.535, 311.14 | 0.534, 24.29
> low_lat=0 | 16 | 95.60 | 119.95*| 302.48 | 213.27 | 10.263, 1750.19 | 13.899, 1006.21 | 3.221, 734.22 | 1.062, 40.40
> ------------------------------------------------------------------------------------------------------------------------
>
> Legend:
> rlat - read latency
> rrlat - random read latency
> wlat - write lancy
> rwlat - random write latency
> * - the two runs reported vastly different numbers: 67.53 and 172.46
>
> So, as you can see, if we turn off the low_latency tunable, we get
> better numbers across the board with the exception of random writes.
> It's also interesting to note that the latencies reported by tiobench
> are more favorable with low_latency set to 0, which is
> counter-intuitive.
>
> So, now it seems we don't have a regression in sequential read
> bandwidth, but we do have a regression in random read bandwidth (though
> the random write latencies look better). So, I'll look into that, as it
> is almost 10%, which is significant.
Sadly, I don't see the improvement you can see :(. The numbers are the
same regardless low_latency set to 0:
2.6.32-rc5 low_latency = 0:
37.39 36.43 36.51 -> 36.776667 0.434920
But my testing environment is a plain SATA drive so that probably
explains the difference...
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
Jan Kara <[email protected]> writes:
> Sadly, I don't see the improvement you can see :(. The numbers are the
> same regardless low_latency set to 0:
> 2.6.32-rc5 low_latency = 0:
> 37.39 36.43 36.51 -> 36.776667 0.434920
> But my testing environment is a plain SATA drive so that probably
> explains the difference...
I just retested (10 runs for each kernel) on a SATA disk with no NCQ
support and I could not see a difference. I'll try to dig up a disk
that support NCQ. Is that what you're using for testing?
Cheers,
Jeff
2.6.29 2.6.32-rc6,low_latency=0
----------------------------------
Average: 34.6648 34.4475
Pop.Std.Dev.: 0.55523 0.21981
On Wed 11-11-09 12:43:30, Jeff Moyer wrote:
> Jan Kara <[email protected]> writes:
>
> > Sadly, I don't see the improvement you can see :(. The numbers are the
> > same regardless low_latency set to 0:
> > 2.6.32-rc5 low_latency = 0:
> > 37.39 36.43 36.51 -> 36.776667 0.434920
> > But my testing environment is a plain SATA drive so that probably
> > explains the difference...
>
> I just retested (10 runs for each kernel) on a SATA disk with no NCQ
> support and I could not see a difference. I'll try to dig up a disk
> that support NCQ. Is that what you're using for testing?
I don't think I am. How do I find out?
> 2.6.29 2.6.32-rc6,low_latency=0
> ----------------------------------
> Average: 34.6648 34.4475
> Pop.Std.Dev.: 0.55523 0.21981
Hmm, strange. Miklos Szeredi tried tiobench on his machine and he also
saw the regression. I'll try to think what could make the difference.
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
Jan Kara <[email protected]> writes:
> On Wed 11-11-09 12:43:30, Jeff Moyer wrote:
>> Jan Kara <[email protected]> writes:
>>
>> > Sadly, I don't see the improvement you can see :(. The numbers are the
>> > same regardless low_latency set to 0:
>> > 2.6.32-rc5 low_latency = 0:
>> > 37.39 36.43 36.51 -> 36.776667 0.434920
>> > But my testing environment is a plain SATA drive so that probably
>> > explains the difference...
>>
>> I just retested (10 runs for each kernel) on a SATA disk with no NCQ
>> support and I could not see a difference. I'll try to dig up a disk
>> that support NCQ. Is that what you're using for testing?
> I don't think I am. How do I find out?
Good question. ;-) I grep for NCQ in dmesg output and make sure it's
greater than 0/32. There may be a better way, though.
>> 2.6.29 2.6.32-rc6,low_latency=0
>> ----------------------------------
>> Average: 34.6648 34.4475
>> Pop.Std.Dev.: 0.55523 0.21981
> Hmm, strange. Miklos Szeredi tried tiobench on his machine and he also
> saw the regression. I'll try to think what could make the difference.
OK, I'll try again.
Cheers,
Jeff
On Thu, Nov 12 2009, Jeff Moyer wrote:
> Jan Kara <[email protected]> writes:
>
> > On Wed 11-11-09 12:43:30, Jeff Moyer wrote:
> >> Jan Kara <[email protected]> writes:
> >>
> >> > Sadly, I don't see the improvement you can see :(. The numbers are the
> >> > same regardless low_latency set to 0:
> >> > 2.6.32-rc5 low_latency = 0:
> >> > 37.39 36.43 36.51 -> 36.776667 0.434920
> >> > But my testing environment is a plain SATA drive so that probably
> >> > explains the difference...
> >>
> >> I just retested (10 runs for each kernel) on a SATA disk with no NCQ
> >> support and I could not see a difference. I'll try to dig up a disk
> >> that support NCQ. Is that what you're using for testing?
> > I don't think I am. How do I find out?
>
> Good question. ;-) I grep for NCQ in dmesg output and make sure it's
> greater than 0/32. There may be a better way, though.
cat /sys/block/<dev>/device/queue_depth
:-)
--
Jens Axboe
Jens Axboe <[email protected]> writes:
> On Thu, Nov 12 2009, Jeff Moyer wrote:
>> Good question. ;-) I grep for NCQ in dmesg output and make sure it's
>> greater than 0/32. There may be a better way, though.
>
> cat /sys/block/<dev>/device/queue_depth
>
> :-)
OK, your comment about only working for SCSI disks threw me off.
Perhaps you meant only works for devices that use the sd driver?
Cheers,
Jeff
On Thu, Nov 12 2009, Jeff Moyer wrote:
> Jens Axboe <[email protected]> writes:
>
> > On Thu, Nov 12 2009, Jeff Moyer wrote:
> >> Good question. ;-) I grep for NCQ in dmesg output and make sure it's
> >> greater than 0/32. There may be a better way, though.
> >
> > cat /sys/block/<dev>/device/queue_depth
> >
> > :-)
>
> OK, your comment about only working for SCSI disks threw me off.
> Perhaps you meant only works for devices that use the sd driver?
Yeah, only works for storage that plugs into the SCSI stack.
--
Jens Axboe
On Thu 12-11-09 15:44:02, Jeff Moyer wrote:
> Jan Kara <[email protected]> writes:
>
> > On Wed 11-11-09 12:43:30, Jeff Moyer wrote:
> >> Jan Kara <[email protected]> writes:
> >>
> >> > Sadly, I don't see the improvement you can see :(. The numbers are the
> >> > same regardless low_latency set to 0:
> >> > 2.6.32-rc5 low_latency = 0:
> >> > 37.39 36.43 36.51 -> 36.776667 0.434920
> >> > But my testing environment is a plain SATA drive so that probably
> >> > explains the difference...
> >>
> >> I just retested (10 runs for each kernel) on a SATA disk with no NCQ
> >> support and I could not see a difference. I'll try to dig up a disk
> >> that support NCQ. Is that what you're using for testing?
> > I don't think I am. How do I find out?
>
> Good question. ;-) I grep for NCQ in dmesg output and make sure it's
> greater than 0/32. There may be a better way, though.
Message in the logs:
ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
ata1.00: ATA-8: Hitachi HTS722016K9SA00, DCDOC54P, max UDMA/133
ata1.00: 312581808 sectors, multi 16: LBA48 NCQ (depth 0/32)
ata1.00: configured for UDMA/133
So apparently no NCQ. /sys/block/sda/device/queue_depth shows 1 but I
guess that's just it's way of saying "no NCQ".
What I thought might make a difference why I'm seeing the drop and you
are not is size of RAM or number of CPUs vs the tiobench file size or
number of threads. I'm running on a machine with 2 GB of RAM, using 4 GB
filesize. The machine has 2 cores and I'm using 16 tiobench threads. I'm
now rerunning tests with various numbers of threads to see how big
difference it makes.
Honza
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Mon 16-11-09 11:47:44, Jan Kara wrote:
> On Thu 12-11-09 15:44:02, Jeff Moyer wrote:
> > Jan Kara <[email protected]> writes:
> >
> > > On Wed 11-11-09 12:43:30, Jeff Moyer wrote:
> > >> Jan Kara <[email protected]> writes:
> > >>
> > >> > Sadly, I don't see the improvement you can see :(. The numbers are the
> > >> > same regardless low_latency set to 0:
> > >> > 2.6.32-rc5 low_latency = 0:
> > >> > 37.39 36.43 36.51 -> 36.776667 0.434920
> > >> > But my testing environment is a plain SATA drive so that probably
> > >> > explains the difference...
> > >>
> > >> I just retested (10 runs for each kernel) on a SATA disk with no NCQ
> > >> support and I could not see a difference. I'll try to dig up a disk
> > >> that support NCQ. Is that what you're using for testing?
> > > I don't think I am. How do I find out?
> >
> > Good question. ;-) I grep for NCQ in dmesg output and make sure it's
> > greater than 0/32. There may be a better way, though.
> Message in the logs:
> ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
> ata1.00: ATA-8: Hitachi HTS722016K9SA00, DCDOC54P, max UDMA/133
> ata1.00: 312581808 sectors, multi 16: LBA48 NCQ (depth 0/32)
> ata1.00: configured for UDMA/133
> So apparently no NCQ. /sys/block/sda/device/queue_depth shows 1 but I
> guess that's just it's way of saying "no NCQ".
>
> What I thought might make a difference why I'm seeing the drop and you
> are not is size of RAM or number of CPUs vs the tiobench file size or
> number of threads. I'm running on a machine with 2 GB of RAM, using 4 GB
> filesize. The machine has 2 cores and I'm using 16 tiobench threads. I'm
> now rerunning tests with various numbers of threads to see how big
> difference it makes.
OK, here are the numbers (3 runs of each test):
2.6.29:
Threads Avg Stddev
1 42.043333 0.860439
2 40.836667 0.322938
4 41.810000 0.114310
8 40.190000 0.419603
16 39.950000 0.403072
32 39.373333 0.766913
2.6.32-rc7:
Threads Avg Stddev
1 41.580000 0.403072
2 39.163333 0.374641
4 39.483333 0.400111
8 38.560000 0.106145
16 37.966667 0.098770
32 36.476667 0.032998
So apparently the difference between 2.6.29 and 2.6.32-rc7 increases as
the number of threads rises. With how many threads have you been running
when using SATA drive and what machine is it?
I'm now running a test with larger file size (8GB instead of 4) to see
what difference it makes.
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
Jan Kara <[email protected]> writes:
> On Mon 16-11-09 11:47:44, Jan Kara wrote:
>> On Thu 12-11-09 15:44:02, Jeff Moyer wrote:
>> > Jan Kara <[email protected]> writes:
>> >
>> > > On Wed 11-11-09 12:43:30, Jeff Moyer wrote:
>> > >> Jan Kara <[email protected]> writes:
>> > >>
>> > >> > Sadly, I don't see the improvement you can see :(. The numbers are the
>> > >> > same regardless low_latency set to 0:
>> > >> > 2.6.32-rc5 low_latency = 0:
>> > >> > 37.39 36.43 36.51 -> 36.776667 0.434920
>> > >> > But my testing environment is a plain SATA drive so that probably
>> > >> > explains the difference...
>> > >>
>> > >> I just retested (10 runs for each kernel) on a SATA disk with no NCQ
>> > >> support and I could not see a difference. I'll try to dig up a disk
>> > >> that support NCQ. Is that what you're using for testing?
>> > > I don't think I am. How do I find out?
>> >
>> > Good question. ;-) I grep for NCQ in dmesg output and make sure it's
>> > greater than 0/32. There may be a better way, though.
>> Message in the logs:
>> ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
>> ata1.00: ATA-8: Hitachi HTS722016K9SA00, DCDOC54P, max UDMA/133
>> ata1.00: 312581808 sectors, multi 16: LBA48 NCQ (depth 0/32)
>> ata1.00: configured for UDMA/133
>> So apparently no NCQ. /sys/block/sda/device/queue_depth shows 1 but I
>> guess that's just it's way of saying "no NCQ".
>>
>> What I thought might make a difference why I'm seeing the drop and you
>> are not is size of RAM or number of CPUs vs the tiobench file size or
>> number of threads. I'm running on a machine with 2 GB of RAM, using 4 GB
>> filesize. The machine has 2 cores and I'm using 16 tiobench threads. I'm
>> now rerunning tests with various numbers of threads to see how big
>> difference it makes.
> OK, here are the numbers (3 runs of each test):
> 2.6.29:
> Threads Avg Stddev
> 1 42.043333 0.860439
> 2 40.836667 0.322938
> 4 41.810000 0.114310
> 8 40.190000 0.419603
> 16 39.950000 0.403072
> 32 39.373333 0.766913
>
> 2.6.32-rc7:
> Threads Avg Stddev
> 1 41.580000 0.403072
> 2 39.163333 0.374641
> 4 39.483333 0.400111
> 8 38.560000 0.106145
> 16 37.966667 0.098770
> 32 36.476667 0.032998
>
> So apparently the difference between 2.6.29 and 2.6.32-rc7 increases as
> the number of threads rises. With how many threads have you been running
> when using SATA drive and what machine is it?
> I'm now running a test with larger file size (8GB instead of 4) to see
> what difference it makes.
I've been running with both 8 and 16 threads. The machine has 4 CPUs
and 4GB of RAM. I've been testing with an 8GB file size.
Cheers,
Jeff
On Mon, Nov 16, 2009 at 6:03 PM, Jeff Moyer <[email protected]> wrote:
> Jan Kara <[email protected]> writes:
>
>> On Mon 16-11-09 11:47:44, Jan Kara wrote:
>> OK, here are the numbers (3 runs of each test):
>> 2.6.29:
>> Threads Avg Stddev
>> 1 42.043333 0.860439
>> 2 40.836667 0.322938
>> 4 41.810000 0.114310
>> 8 40.190000 0.419603
>> 16 39.950000 0.403072
>> 32 39.373333 0.766913
>>
>> 2.6.32-rc7:
>> Threads Avg Stddev
>> 1 41.580000 0.403072
>> 2 39.163333 0.374641
>> 4 39.483333 0.400111
>> 8 38.560000 0.106145
>> 16 37.966667 0.098770
>> 32 36.476667 0.032998
>>
>> So apparently the difference between 2.6.29 and 2.6.32-rc7 increases as
>> the number of threads rises. With how many threads have you been running
>> when using SATA drive and what machine is it?
>> I'm now running a test with larger file size (8GB instead of 4) to see
>> what difference it makes.
>
> I've been running with both 8 and 16 threads. The machine has 4 CPUs
> and 4GB of RAM. I've been testing with an 8GB file size.
Other details may be relevant, e.g.the file system on which the file
is located, whether the caches are dropped before starting each run,
and so on.
Corrado
>
> Cheers,
> Jeff
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
On Mon 16-11-09 12:03:00, Jeff Moyer wrote:
> Jan Kara <[email protected]> writes:
>
> > On Mon 16-11-09 11:47:44, Jan Kara wrote:
> >> On Thu 12-11-09 15:44:02, Jeff Moyer wrote:
> >> > Jan Kara <[email protected]> writes:
> >> >
> >> > > On Wed 11-11-09 12:43:30, Jeff Moyer wrote:
> >> > >> Jan Kara <[email protected]> writes:
> >> > >>
> >> > >> > Sadly, I don't see the improvement you can see :(. The numbers are the
> >> > >> > same regardless low_latency set to 0:
> >> > >> > 2.6.32-rc5 low_latency = 0:
> >> > >> > 37.39 36.43 36.51 -> 36.776667 0.434920
> >> > >> > But my testing environment is a plain SATA drive so that probably
> >> > >> > explains the difference...
> >> > >>
> >> > >> I just retested (10 runs for each kernel) on a SATA disk with no NCQ
> >> > >> support and I could not see a difference. I'll try to dig up a disk
> >> > >> that support NCQ. Is that what you're using for testing?
> >> > > I don't think I am. How do I find out?
> >> >
> >> > Good question. ;-) I grep for NCQ in dmesg output and make sure it's
> >> > greater than 0/32. There may be a better way, though.
> >> Message in the logs:
> >> ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
> >> ata1.00: ATA-8: Hitachi HTS722016K9SA00, DCDOC54P, max UDMA/133
> >> ata1.00: 312581808 sectors, multi 16: LBA48 NCQ (depth 0/32)
> >> ata1.00: configured for UDMA/133
> >> So apparently no NCQ. /sys/block/sda/device/queue_depth shows 1 but I
> >> guess that's just it's way of saying "no NCQ".
> >>
> >> What I thought might make a difference why I'm seeing the drop and you
> >> are not is size of RAM or number of CPUs vs the tiobench file size or
> >> number of threads. I'm running on a machine with 2 GB of RAM, using 4 GB
> >> filesize. The machine has 2 cores and I'm using 16 tiobench threads. I'm
> >> now rerunning tests with various numbers of threads to see how big
> >> difference it makes.
> > OK, here are the numbers (3 runs of each test):
> > 2.6.29:
> > Threads Avg Stddev
> > 1 42.043333 0.860439
> > 2 40.836667 0.322938
> > 4 41.810000 0.114310
> > 8 40.190000 0.419603
> > 16 39.950000 0.403072
> > 32 39.373333 0.766913
> >
> > 2.6.32-rc7:
> > Threads Avg Stddev
> > 1 41.580000 0.403072
> > 2 39.163333 0.374641
> > 4 39.483333 0.400111
> > 8 38.560000 0.106145
> > 16 37.966667 0.098770
> > 32 36.476667 0.032998
> >
> > So apparently the difference between 2.6.29 and 2.6.32-rc7 increases as
> > the number of threads rises. With how many threads have you been running
> > when using SATA drive and what machine is it?
> > I'm now running a test with larger file size (8GB instead of 4) to see
> > what difference it makes.
>
> I've been running with both 8 and 16 threads. The machine has 4 CPUs
> and 4GB of RAM. I've been testing with an 8GB file size.
OK, I see a similar regression also with 8GB file size:
2.6.29:
1 41.556667 0.787415
2 40.866667 0.714112
4 40.726667 0.228376
8 38.596667 0.344706
16 39.076667 0.180801
32 37.743333 0.147271
2.6.32-rc7:
1 41.860000 0.063770
2 39.196667 0.012472
4 39.426667 0.162138
8 37.550000 0.040825
16 37.710000 0.096264
32 35.680000 0.109848
BTW: I'm running the test always on a fresh ext3 in data=ordered mode
with barrier=1.
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR