2004-09-01 13:24:44

by Mark Lord

[permalink] [raw]
Subject: Re: [PATCH] Configure IDE probe delays

Bartlomiej Zolnierkiewicz wrote:
>
>>What determines whether 48 bit addressing will be used then?
>
> Availability of 48-bit addressing feature set and host capabilities
> (some don't support LBA48 when DMA is used etc.).

I haven't examined the "released" IDE drivers in some time,
but one optimisation that can save a LOT of CPU usage
is for the driver to only use LBA48 *when necessary*,
and use LBA28 I/O otherwise.

Each access to an IDE register typically chews up 600+ns,
or the equivalent of a couple thousand instruction executions
on a modern core. Avoiding LBA48 when it's not needed will
save four such accesses per I/O, or about 2.5us.

LBA48 is only needed when (1) the sector count is greater than 256,
and/or (2) the ending sector number >= (1<<28).

I regularly include this optimisation in the drivers I have been
working on since LBA48 first appeared.

Cheers
--
Mark Lord
(hdparm keeper & the original "Linux IDE Guy")


2004-09-01 14:44:02

by Jeff Garzik

[permalink] [raw]
Subject: Re: [PATCH] Configure IDE probe delays

Mark Lord wrote:
> Bartlomiej Zolnierkiewicz wrote:
>
>>
>>> What determines whether 48 bit addressing will be used then?
>>
>>
>> Availability of 48-bit addressing feature set and host capabilities
>> (some don't support LBA48 when DMA is used etc.).
>
>
> I haven't examined the "released" IDE drivers in some time,
> but one optimisation that can save a LOT of CPU usage
> is for the driver to only use LBA48 *when necessary*,
> and use LBA28 I/O otherwise.
>
> Each access to an IDE register typically chews up 600+ns,
> or the equivalent of a couple thousand instruction executions
> on a modern core. Avoiding LBA48 when it's not needed will
> save four such accesses per I/O, or about 2.5us.


Doing this is either pointless or impossible on newer SATA controllers.
Most are memory-mapped I/O not PIO, where the high-order bits of the
ATA taskfile are accessed due to an extended register size, not
"double-pumping" a FIFO.

Even-newer SATA controllers are FIS-based rather than taskfile-based, so
you pass it a FIS (containing all the registers) unconditionally.

Jeff


2004-09-01 15:37:11

by Mark Lord

[permalink] [raw]
Subject: Re: [PATCH] Configure IDE probe delays

>> I haven't examined the "released" IDE drivers in some time,
>> but one optimisation that can save a LOT of CPU usage
>> is for the driver to only use LBA48 *when necessary*,
>> and use LBA28 I/O otherwise.
>>
>> Each access to an IDE register typically chews up 600+ns,
>> or the equivalent of a couple thousand instruction executions
>> on a modern core. Avoiding LBA48 when it's not needed will
>> save four such accesses per I/O, or about 2.5us.

Yup, this is for ATA register accesses, not SATA FISs.
--
Mark Lord
(hdparm keeper & the original "Linux IDE Guy")

2004-09-01 15:40:11

by Mark Lord

[permalink] [raw]
Subject: Re: [PATCH] Configure IDE probe delays

> Doing this is either pointless or impossible on newer SATA controllers.
> Most are memory-mapped I/O not PIO, where the high-order bits of the
> ATA taskfile are accessed due to an extended register size, not
> "double-pumping" a FIFO.
>
> Even-newer SATA controllers are FIS-based rather than taskfile-based, so
> you pass it a FIS (containing all the registers) unconditionally.

PCI accesses are not free, so anything that avoids having to
go over the PCI bus is a worthwhile optimization.

The processor buses run at 200-800Mhz or so, whereas PCI is normally
only clocking at 33Mhz, sometimes at 66Mhz.

With good ADMA or host-queuing controllers that access system
memory directly for their command blocks, then there's not much
(if any) penalty for the extra LBA48 setup. But for "normal"
controllers (if such a beast even exists), the extra writes across
the PCI bus can be costly.

Hardware write-buffer FIFOs between the CPU and the PCI bus
can reduce the impact of this somewhat, but they are often
only 2-4 entries deep, and will be filled by a normal (S)ATA
command setup sequence.

This is one of those finer points that is very difficult to measure,
since the I/O throughput is pretty much unaffected by it. But CPU
cycle count per-I/O setup is one way to measure it.

Cheers
--
Mark Lord
(hdparm keeper & the original "Linux IDE Guy")

Jeff Garzik wrote:
> Mark Lord wrote:
>
>> Bartlomiej Zolnierkiewicz wrote:
>>
>>>
>>>> What determines whether 48 bit addressing will be used then?
>>>
>>>
>>>
>>> Availability of 48-bit addressing feature set and host capabilities
>>> (some don't support LBA48 when DMA is used etc.).
>>
>>
>>
>> I haven't examined the "released" IDE drivers in some time,
>> but one optimisation that can save a LOT of CPU usage
>> is for the driver to only use LBA48 *when necessary*,
>> and use LBA28 I/O otherwise.
>>
>> Each access to an IDE register typically chews up 600+ns,
>> or the equivalent of a couple thousand instruction executions
>> on a modern core. Avoiding LBA48 when it's not needed will
>> save four such accesses per I/O, or about 2.5us.
>
>
>
> Doing this is either pointless or impossible on newer SATA controllers.
> Most are memory-mapped I/O not PIO, where the high-order bits of the
> ATA taskfile are accessed due to an extended register size, not
> "double-pumping" a FIFO.
>
> Even-newer SATA controllers are FIS-based rather than taskfile-based, so
> you pass it a FIS (containing all the registers) unconditionally.
>
> Jeff
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

Subject: Re: [PATCH] Configure IDE probe delays


Jens come with this long time ago but IDE driver is not ready for it,
there are some gaps to fill first.

Anyway, thanks for reminding me about this.

Cheers.

On Wednesday 01 September 2004 15:20, Mark Lord wrote:
> Bartlomiej Zolnierkiewicz wrote:
> >
> >>What determines whether 48 bit addressing will be used then?
> >
> > Availability of 48-bit addressing feature set and host capabilities
> > (some don't support LBA48 when DMA is used etc.).
>
> I haven't examined the "released" IDE drivers in some time,
> but one optimisation that can save a LOT of CPU usage
> is for the driver to only use LBA48 *when necessary*,
> and use LBA28 I/O otherwise.
>
> Each access to an IDE register typically chews up 600+ns,
> or the equivalent of a couple thousand instruction executions
> on a modern core. Avoiding LBA48 when it's not needed will
> save four such accesses per I/O, or about 2.5us.
>
> LBA48 is only needed when (1) the sector count is greater than 256,
> and/or (2) the ending sector number >= (1<<28).
>
> I regularly include this optimisation in the drivers I have been
> working on since LBA48 first appeared.
>
> Cheers

2004-09-01 16:30:25

by Alan

[permalink] [raw]
Subject: Re: [PATCH] Configure IDE probe delays

On Mer, 2004-09-01 at 14:20, Mark Lord wrote:
> LBA48 is only needed when (1) the sector count is greater than 256,
> and/or (2) the ending sector number >= (1<<28).

I've played with this a bit and in the -ac IDE code it can drop back
to LBA28 for devices that are small enough not to need LBA48 when the
controller only supports PIO for LBA48 modes (eg some ALi) as 2.4-ac
did.

> I regularly include this optimisation in the drivers I have been
> working on since LBA48 first appeared.

It isn't always a win. You get cut down to 256 sectors per I/O which for
some workloads has a cost and you need to factor that into the command
issue choice as well as the last sector number being accessed.

Alan

2004-09-01 19:08:43

by Lee Revell

[permalink] [raw]
Subject: Re: [PATCH] Configure IDE probe delays

On Wed, 2004-09-01 at 11:06, Alan Cox wrote:
> On Mer, 2004-09-01 at 14:20, Mark Lord wrote:
> > LBA48 is only needed when (1) the sector count is greater than 256,
> > and/or (2) the ending sector number >= (1<<28).
>
> I've played with this a bit and in the -ac IDE code it can drop back
> to LBA28 for devices that are small enough not to need LBA48 when the
> controller only supports PIO for LBA48 modes (eg some ALi) as 2.4-ac
> did.
>
> > I regularly include this optimisation in the drivers I have been
> > working on since LBA48 first appeared.
>
> It isn't always a win. You get cut down to 256 sectors per I/O which for
> some workloads has a cost and you need to factor that into the command
> issue choice as well as the last sector number being accessed.
>

I have never been able to measure a decrease in disk throughput in any
disk benchmark with 256 sectors per I/O vs. 1024. This is a modestly
powered desktop with a single drive though. What kinds of workloads
would you expect to be affected by this?

Lee

2004-09-01 19:36:55

by Lee Revell

[permalink] [raw]
Subject: Re: [PATCH] Configure IDE probe delays

On Wed, 2004-09-01 at 11:36, Mark Lord wrote:
> With good ADMA or host-queuing controllers that access system
> memory directly for their command blocks, then there's not much
> (if any) penalty for the extra LBA48 setup. But for "normal"
> controllers (if such a beast even exists), the extra writes across
> the PCI bus can be costly.
>
> Hardware write-buffer FIFOs between the CPU and the PCI bus
> can reduce the impact of this somewhat, but they are often
> only 2-4 entries deep, and will be filled by a normal (S)ATA
> command setup sequence.
>
> This is one of those finer points that is very difficult to measure,
> since the I/O throughput is pretty much unaffected by it. But CPU
> cycle count per-I/O setup is one way to measure it.
>

The effect can be measured using a recent version of the voluntary
preemption patches, and disabling hardirq preemption. In this situation
the IDE I/O completion is by far the longest non-preemptible code path,
so can be easily profiled from the latency traces.

Lee

2004-09-01 19:49:39

by Alan

[permalink] [raw]
Subject: Re: [PATCH] Configure IDE probe delays

On Mer, 2004-09-01 at 20:36, Lee Revell wrote:
> The effect can be measured using a recent version of the voluntary
> preemption patches, and disabling hardirq preemption. In this situation
> the IDE I/O completion is by far the longest non-preemptible code path,
> so can be easily profiled from the latency traces.

A lot of IDE controllers hold off the CPU for long times when you do I/O
cycles. The other factor is PIO which defaults to IRQ masking for safety
on old controllers. For PCI we should probably default the other way.

2004-09-02 16:06:00

by Mark Lord

[permalink] [raw]
Subject: Re: [PATCH] Configure IDE probe delays

I agree that >256 sectors will be a win, worth the
extra overhead because it will save subsequent overheads
of extra commands.

Which is why my original text (quoted by Alan) said:

On Mer, 2004-09-01 at 14:20, Mark Lord wrote:

>> LBA48 is only needed when (1) the sector count is greater than 256,
>> and/or (2) the ending sector number >= (1<<28).


Cheers
--
Mark Lord
(hdparm keeper & the original "Linux IDE Guy")