2005-05-13 19:05:42

by Benjamin LaHaise

[permalink] [raw]
Subject: [rfc/patch] libata -- port configurable delays

Hello Jeff et al,

The patch below makes the delays in ata_pause() and ata_busy_wait()
configurable on a per-port basis, and enables the no delay flag on
the one chipset I've tested on. Getting rid of the delays is worth
quite a bit: doing sequential 512 byte O_DIRECT AIO reads results in
a drop from 35.743s to 29.205s using simple-aio-min_nr 20480 10 (a copy
is available at http://www.kvack.org/~bcrl/simple-aio-min_nr.c).
Before this patch __delay() is the number one entry in oprofile
results for this workload. Does this look like a reasonable approach
for chipsets that aren't completely braindead? Cheers,

-ben
--
"Time is what keeps everything from happening all at once." -- John Wheeler

CPU: P4 / Xeon with 2 hyper-threads, speed 2994.52 MHz (estimated)
Counted INSTR_RETIRED events (retired instructions) with a unit mask of 0x03 (mu
ltiple flags) count 90000
samples % app name symbol name
6676 19.5011 vmlinux __delay
923 2.6962 vmlinux __blockdev_direct_IO
733 2.1411 vmlinux memset
711 2.0769 vmlinux kmem_cache_alloc
698 2.0389 vmlinux kmem_cache_free
639 1.8666 vmlinux mempool_alloc
455 1.3291 vmlinux as_insert_request
440 1.2853 vmlinux scsi_request_fn
404 1.1801 vmlinux ext3_get_branch
404 1.1801 vmlinux mempool_free
374 1.0925 vmlinux scsi_dispatch_cmd
356 1.0399 vmlinux __mod_timer
352 1.0282 vmlinux __make_request
...

Signed-off-by: Benjamin LaHaise <[email protected]>
diff -purN v2.6.12-rc4/drivers/scsi/ata_piix.c libata-rc4/drivers/scsi/ata_piix.c
--- v2.6.12-rc4/drivers/scsi/ata_piix.c 2005-04-28 11:01:54.000000000 -0400
+++ libata-rc4/drivers/scsi/ata_piix.c 2005-05-13 13:30:39.000000000 -0400
@@ -228,6 +228,7 @@ static struct ata_port_info piix_port_in
.sht = &piix_sht,
.host_flags = ATA_FLAG_SATA | ATA_FLAG_SRST |
PIIX_FLAG_COMBINED | PIIX_FLAG_CHECKINTR |
+ ATA_FLAG_NO_UDELAY |
ATA_FLAG_SLAVE_POSS,
.pio_mask = 0x1f, /* pio0-4 */
.mwdma_mask = 0x07, /* mwdma0-2 */
diff -purN v2.6.12-rc4/include/linux/libata.h libata-rc4/include/linux/libata.h
--- v2.6.12-rc4/include/linux/libata.h 2005-04-06 17:28:10.000000000 -0400
+++ libata-rc4/include/linux/libata.h 2005-05-13 13:32:15.000000000 -0400
@@ -113,6 +113,7 @@ enum {
ATA_FLAG_MMIO = (1 << 6), /* use MMIO, not PIO */
ATA_FLAG_SATA_RESET = (1 << 7), /* use COMRESET */
ATA_FLAG_PIO_DMA = (1 << 8), /* PIO cmds via DMA */
+ ATA_FLAG_NO_UDELAY = (1 << 9), /* don't udelay on port access */

ATA_QCFLAG_ACTIVE = (1 << 1), /* cmd not yet ack'd to scsi lyer */
ATA_QCFLAG_SG = (1 << 3), /* have s/g table? */
@@ -469,7 +470,8 @@ static inline u8 ata_chk_status(struct a
static inline void ata_pause(struct ata_port *ap)
{
ata_altstatus(ap);
- ndelay(400);
+ if (!(ap->flags & ATA_FLAG_NO_UDELAY))
+ ndelay(400);
}

static inline u8 ata_busy_wait(struct ata_port *ap, unsigned int bits,
@@ -478,7 +480,8 @@ static inline u8 ata_busy_wait(struct at
u8 status;

do {
- udelay(10);
+ if (!(ap->flags & ATA_FLAG_NO_UDELAY))
+ udelay(10);
status = ata_chk_status(ap);
max--;
} while ((status & bits) && (max > 0));


2005-05-13 19:25:17

by Jeff Garzik

[permalink] [raw]
Subject: Re: [rfc/patch] libata -- port configurable delays

Of course, the real solution is to get a modern controller like AHCI or
SiI 3124, which don't require banging on the Status register -- an
ultimately pointless, completely emulated operation in SATA.

:)

Jeff



2005-05-13 19:44:52

by Jeff Garzik

[permalink] [raw]
Subject: Re: [rfc/patch] libata -- port configurable delays

Benjamin LaHaise wrote:
> Hello Jeff et al,
>
> The patch below makes the delays in ata_pause() and ata_busy_wait()
> configurable on a per-port basis, and enables the no delay flag on
> the one chipset I've tested on. Getting rid of the delays is worth
> quite a bit: doing sequential 512 byte O_DIRECT AIO reads results in
> a drop from 35.743s to 29.205s using simple-aio-min_nr 20480 10 (a copy
> is available at http://www.kvack.org/~bcrl/simple-aio-min_nr.c).
> Before this patch __delay() is the number one entry in oprofile
> results for this workload. Does this look like a reasonable approach
> for chipsets that aren't completely braindead? Cheers,

Well, there are several things going on here.

> @@ -469,7 +470,8 @@ static inline u8 ata_chk_status(struct a
> static inline void ata_pause(struct ata_port *ap)
> {
> ata_altstatus(ap);
> - ndelay(400);
> + if (!(ap->flags & ATA_FLAG_NO_UDELAY))
> + ndelay(400);

This delay is required per spec. So, this specific change is vetoed.


> @@ -478,7 +480,8 @@ static inline u8 ata_busy_wait(struct at
> u8 status;
>
> do {
> - udelay(10);
> + if (!(ap->flags & ATA_FLAG_NO_UDELAY))
> + udelay(10);
> status = ata_chk_status(ap);
> max--;
> } while ((status & bits) && (max > 0));

This delay is based on field experience, rather than spec. I'm open to
making this optional, as you have done. Some issues related to this
delay, to consider:

1) Nothing in life is free. This delay is useful on some platforms,
because banging away at the Status register for extended periods of time
can cause an insane amount of PCI IO traffic. Removing the delay just
moves the punishment from one area to another.

2) In a few controllers, the SATA<->FIS emulation can go kerflooey if
you bang the Status register 'too hard'.

3) IIRC some rare PATA devices don't like having their Status register
banged "too hard". No data, just a vague memory.

4) It may be worthwhile to rewrite the loop to check the Status register
_first_, then delay.

Finally, simply disabling the delay is IMO _far_ too dangerous on such a
popular driver (ata_piix).

I would be conservative, and create a module option for libata (not
ata_piix) which allows a user to globally disable the delay. And make
sure that option defaults to 'delay', the current behavior.

Creative suggestions welcomed...

Jeff


2005-05-13 20:01:51

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: [rfc/patch] libata -- port configurable delays

On Fri, May 13, 2005 at 03:13:50PM -0400, Jeff Garzik wrote:
> This delay is based on field experience, rather than spec. I'm open to
> making this optional, as you have done. Some issues related to this
> delay, to consider:

Do you have the original bug reports so we know which chipsets / drive
combinations are problematic?

> 1) Nothing in life is free. This delay is useful on some platforms,
> because banging away at the Status register for extended periods of time
> can cause an insane amount of PCI IO traffic. Removing the delay just
> moves the punishment from one area to another.

In this case the time to complete the task is reduced by 18%, so the
delay is definately excessive. That 18% is remarkably close to the
19.5% hit for __delay in the profile results.

> 2) In a few controllers, the SATA<->FIS emulation can go kerflooey if
> you bang the Status register 'too hard'.

The patch only enables it on one variant of ICH6 SATA so far.

> 3) IIRC some rare PATA devices don't like having their Status register
> banged "too hard". No data, just a vague memory.
>
> 4) It may be worthwhile to rewrite the loop to check the Status register
> _first_, then delay.
>
> Finally, simply disabling the delay is IMO _far_ too dangerous on such a
> popular driver (ata_piix).

Have you a better suggestion for blacklisting devices?

> I would be conservative, and create a module option for libata (not
> ata_piix) which allows a user to globally disable the delay. And make
> sure that option defaults to 'delay', the current behavior.

This option is undesirable for hardware which behaves correctly. The
vast majority of users aren't going to be fiddling with hdparm options
(I certainly don't want to see modern SATA being a repeat of the magic
handwaving the IDE driver and DMA went through for years).

At the very least the option should be introduced and new drivers should
start life without these legacy delays in place. If doing it on a per
port interface isn't good enough, what is? Globally seems worse to me
as then any system with a mix of old and new hardware isn't going to do
the right thing.

-ben
--
"Time is what keeps everything from happening all at once." -- John Wheeler

2005-05-13 21:23:03

by Alan

[permalink] [raw]
Subject: Re: [rfc/patch] libata -- port configurable delays

On Gwe, 2005-05-13 at 19:58, Benjamin LaHaise wrote:
> is available at http://www.kvack.org/~bcrl/simple-aio-min_nr.c).
> Before this patch __delay() is the number one entry in oprofile
> results for this workload. Does this look like a reasonable approach
> for chipsets that aren't completely braindead? Cheers,

If your chipset implements the 400nS lockout in hardware it certainly
seems to make sense. Nice to know someone has put it in hardware

2005-05-13 21:54:59

by Alan

[permalink] [raw]
Subject: Re: [rfc/patch] libata -- port configurable delays

On Gwe, 2005-05-13 at 21:03, Benjamin LaHaise wrote:
> > 3) IIRC some rare PATA devices don't like having their Status register
> > banged "too hard". No data, just a vague memory.

Not that I am aware of. There are a few ICH/PIIX variants where if you
read status during a transaction at the wrong time bad stuff occurs
including to the block on disk. That may be what you are thinking of

> >
> > 4) It may be worthwhile to rewrite the loop to check the Status register
> > _first_, then delay.

The 400nS delay after a command is required before status becomes valid.
This isn't about 'incorrect' devices in the command case. It is about
strictly correct behaviour and propogation/response times. For the cases
its not required and you wan to keep PCI load down then checking first
is clearly logical.

Also btw beware of PCI posting - writel/ndelay(400) isn't going to do
the right thing.

2005-05-13 23:08:18

by Jeff Garzik

[permalink] [raw]
Subject: Re: [rfc/patch] libata -- port configurable delays

Alan Cox wrote:
>>>4) It may be worthwhile to rewrite the loop to check the Status register
>>>_first_, then delay.
>
>
> The 400nS delay after a command is required before status becomes valid.
> This isn't about 'incorrect' devices in the command case. It is about
> strictly correct behaviour and propogation/response times. For the cases
> its not required and you wan to keep PCI load down then checking first
> is clearly logical.

The 400nS delay isn't the one in the loop. I was referring to the other
delay.

Putting the Status register read first will also flush out any posted
writes, before delaying, and potentially exit more rapidly if the device
is immediately ready.

Jeff


2005-05-13 23:22:20

by Jeff Garzik

[permalink] [raw]
Subject: Re: [rfc/patch] libata -- port configurable delays

Alan Cox wrote:
> On Gwe, 2005-05-13 at 19:58, Benjamin LaHaise wrote:
>
>>is available at http://www.kvack.org/~bcrl/simple-aio-min_nr.c).
>>Before this patch __delay() is the number one entry in oprofile
>>results for this workload. Does this look like a reasonable approach
>>for chipsets that aren't completely braindead? Cheers,
>
>
> If your chipset implements the 400nS lockout in hardware it certainly
> seems to make sense. Nice to know someone has put it in hardware

No, it's just mostly irrelevant under SATA.

Under SATA you are -not- talking to a device when you touch
Status/AltStatus, you are talking to the host controller. Specifically,
you're talking to a controller buffer that stores a copy of the ATA
shadow registers.

The ATA registers are transmitted to the device in a single packet,
called a FIS, when the Command or Device Control register is written.

When the device updates its status, or completes a command, it sends a
FIS from device to controller, instructing the controller to update its
cached copy of the Status register.

You're bitbanging a buffer, in SATA.

Jeff


2005-05-14 01:51:31

by Mark Lord

[permalink] [raw]
Subject: Re: [rfc/patch] libata -- port configurable delays

PIIX chips should be fine w.r.t. unnecessary delays (or lack thereof).

I recall only one situation WAY BACK in time that had an issue with
continuous banging, but don't remember anything of the details
other than that this was circa 1995 or so.

Cheers

2005-05-14 21:13:33

by Alan

[permalink] [raw]
Subject: Re: [rfc/patch] libata -- port configurable delays

On Sad, 2005-05-14 at 00:21, Jeff Garzik wrote:
> > If your chipset implements the 400nS lockout in hardware it certainly
> > seems to make sense. Nice to know someone has put it in hardware
>
> No, it's just mostly irrelevant under SATA.

libata is for PATA devices too.

> The ATA registers are transmitted to the device in a single packet,
> called a FIS, when the Command or Device Control register is written.
>
> When the device updates its status, or completes a command, it sends a
> FIS from device to controller, instructing the controller to update its
> cached copy of the Status register.

The controller in ATA style behaviour isn't required to have the status
register valid at the point the FIS is written. It probably does but it
isn't required.

2005-05-14 21:48:12

by Jeff Garzik

[permalink] [raw]
Subject: Re: [rfc/patch] libata -- port configurable delays

Alan Cox wrote:
> On Sad, 2005-05-14 at 00:21, Jeff Garzik wrote:
>
>>>If your chipset implements the 400nS lockout in hardware it certainly
>>>seems to make sense. Nice to know someone has put it in hardware
>>
>>No, it's just mostly irrelevant under SATA.
>
>
> libata is for PATA devices too.

Yes, but Ben has SATA, which is the hardware in question.

The 400nS delay which interests you so much in this thread isn't the
interesting delay. The 400nS delay isn't going away. The
udelay(10)-in-a-loop is what slows things down, and shows up on Ben's
profiles.

The best solution is to convert PIO from polling to interrupt driven, so
that we're not beating up the bus (big reason for the udelay).


>>The ATA registers are transmitted to the device in a single packet,
>>called a FIS, when the Command or Device Control register is written.
>>
>>When the device updates its status, or completes a command, it sends a
>>FIS from device to controller, instructing the controller to update its
>>cached copy of the Status register.
>
>
> The controller in ATA style behaviour isn't required to have the status
> register valid at the point the FIS is written. It probably does but it
> isn't required.

After a tiny period of time, it's required to have set the BSY bit.

Jeff