2004-03-27 22:37:36

by Jeff Garzik

[permalink] [raw]
Subject: [PATCH] speed up SATA

===== drivers/scsi/libata-scsi.c 1.18 vs edited =====
--- 1.18/drivers/scsi/libata-scsi.c Sat Mar 27 00:21:29 2004
+++ edited/drivers/scsi/libata-scsi.c Sat Mar 27 16:04:39 2004
@@ -168,6 +168,23 @@
sdev->use_10_for_ms = 1;
blk_queue_max_phys_segments(sdev->request_queue, LIBATA_MAX_PRD);

+ if (sdev->id < ATA_MAX_DEVICES) {
+ struct ata_port *ap;
+ struct ata_device *dev;
+
+ ap = (struct ata_port *) &sdev->host->hostdata[0];
+ dev = &ap->device[sdev->id];
+
+ if (dev->flags & ATA_DFLAG_LBA48) {
+ sdev->host->max_sectors = 65534;
+ blk_queue_max_sectors(sdev->request_queue, 65534);
+ printk(KERN_INFO "ata%u: dev %u max request 32MB (lba48)\n",
+ ap->id, sdev->id);
+ } else
+ printk(KERN_INFO "ata%u: dev %u max request 128K\n",
+ ap->id, sdev->id);
+ }
+
return 0; /* scsi layer doesn't check return value, sigh */
}


Attachments:
patch (869.00 B)

2004-03-27 23:04:48

by Stefan Smietanowski

[permalink] [raw]
Subject: Re: [PATCH] speed up SATA

Hi Jeff.

> The "lba48" feature in ATA allows for addressing of sectors > 137GB, and
> also allows for transfers of up to 64K sector, instead of the
> traditional 256 sectors in older ATA.
>
> libata simply limited all transfers to a 200 sectors (just under the 256
> sector limit). This was mainly being careful, and making sure I had a
> solution that worked everywhere. I also wanted to see how the iommu S/G
> stuff would shake out.
>
> Things seem to be looking pretty good, so it's now time to turn on
> lba48-sized transfers. Most SATA disks will be lba48 anyway, even the
> ones smaller than 137GB, for this and other reasons.
>
> With this simple patch, the max request size goes from 128K to 32MB...
> so you can imagine this will definitely help performance. Throughput
> goes up. Interrupts go down. Fun for the whole family.

What will happen when a PATA disk lies behind a Marvel(ous) bridge, as
in most SATA disks today?

Is large transfers mandatory in the LBA48 spec and is LBA48 really
mandatory in SATA?

And yes, I saw that the dmesg showed a Maxtor drive, but I'm uncertain
if that disk of yours has a Marvel chip on or not, since newer Maxtors
might (have) come out (already) without a Marvel chip, I just don't
know.

// Stefan

2004-03-27 23:12:19

by Jeff Garzik

[permalink] [raw]
Subject: Re: [PATCH] speed up SATA

Stefan Smietanowski wrote:
> What will happen when a PATA disk lies behind a Marvel(ous) bridge, as
> in most SATA disks today?

Larger transfers work fine in PATA, too.

WRT bridges, it is generally the best idea to limit to UDMA/100 (udma),
but larger transfers are OK.


> Is large transfers mandatory in the LBA48 spec and is LBA48 really
> mandatory in SATA?

Yes and no, in that order :) SATA doesn't mandate lba48, but it is
highly unlikely that you will see SATA disk without lba48.

Regardless, libata supports what the drive supports. Older disks still
work just fine.

Jeff



Subject: Re: [PATCH] speed up SATA

On Sunday 28 of March 2004 00:04, Stefan Smietanowski wrote:
> Hi Jeff.
>
> > The "lba48" feature in ATA allows for addressing of sectors > 137GB, and
> > also allows for transfers of up to 64K sector, instead of the
> > traditional 256 sectors in older ATA.
> >
> > libata simply limited all transfers to a 200 sectors (just under the 256
> > sector limit). This was mainly being careful, and making sure I had a
> > solution that worked everywhere. I also wanted to see how the iommu S/G
> > stuff would shake out.
> >
> > Things seem to be looking pretty good, so it's now time to turn on
> > lba48-sized transfers. Most SATA disks will be lba48 anyway, even the
> > ones smaller than 137GB, for this and other reasons.
> >
> > With this simple patch, the max request size goes from 128K to 32MB...
> > so you can imagine this will definitely help performance. Throughput
> > goes up. Interrupts go down. Fun for the whole family.

What about latency?

What about recently discussed PRD table "limit" of 256 entries?

AFAIR these are the reasons why IDE driver is currently
limiting max request size to 1024K on LBA48 disks.

> What will happen when a PATA disk lies behind a Marvel(ous) bridge, as
> in most SATA disks today?

Most modern PATA disks support LBA48 and IDE driver has been using
large transfers for some time. :-)

> Is large transfers mandatory in the LBA48 spec and is LBA48 really
> mandatory in SATA?

large transfers are part of LBA48 spec

> And yes, I saw that the dmesg showed a Maxtor drive, but I'm uncertain
> if that disk of yours has a Marvel chip on or not, since newer Maxtors
> might (have) come out (already) without a Marvel chip, I just don't
> know.

Regards,
Bartlomiej

2004-03-27 23:36:35

by Jeff Garzik

[permalink] [raw]
Subject: Re: [PATCH] speed up SATA

Bartlomiej Zolnierkiewicz wrote:
> On Sunday 28 of March 2004 00:04, Stefan Smietanowski wrote:
>
>>Hi Jeff.
>>
>>
>>>The "lba48" feature in ATA allows for addressing of sectors > 137GB, and
>>>also allows for transfers of up to 64K sector, instead of the
>>>traditional 256 sectors in older ATA.
>>>
>>>libata simply limited all transfers to a 200 sectors (just under the 256
>>>sector limit). This was mainly being careful, and making sure I had a
>>>solution that worked everywhere. I also wanted to see how the iommu S/G
>>>stuff would shake out.
>>>
>>>Things seem to be looking pretty good, so it's now time to turn on
>>>lba48-sized transfers. Most SATA disks will be lba48 anyway, even the
>>>ones smaller than 137GB, for this and other reasons.
>>>
>>>With this simple patch, the max request size goes from 128K to 32MB...
>>>so you can imagine this will definitely help performance. Throughput
>>>goes up. Interrupts go down. Fun for the whole family.
>
>
> What about latency?
>
> What about recently discussed PRD table "limit" of 256 entries?
>
> AFAIR these are the reasons why IDE driver is currently
> limiting max request size to 1024K on LBA48 disks.

That's the main limitation on request size right now... libata limits
S/G table entries to 128[1], so a perfectly aligned, fully merged
transfer will top out at 8MB. You don't see that unless you're on a
totally quiet machine with tons of free, contiguous pages. So in
practice it winds up being much smaller, the more loaded the system gets
(and pagecache gets fragmented).

Latency definitely changes for the default case, but remember that a lot
of that is writeback, or streaming writes. Latency-sensitive
applications already know how to send small or no-wait I/Os, because
standard pagecache writeback latency is highly variable at best :)

Jeff



2004-03-27 23:37:56

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH] speed up SATA

Jeff Garzik wrote:
>
> The "lba48" feature in ATA allows for addressing of sectors > 137GB, and
> also allows for transfers of up to 64K sector, instead of the
> traditional 256 sectors in older ATA.
>
> libata simply limited all transfers to a 200 sectors (just under the 256
> sector limit). This was mainly being careful, and making sure I had a
> solution that worked everywhere. I also wanted to see how the iommu S/G
> stuff would shake out.
>
> Things seem to be looking pretty good, so it's now time to turn on
> lba48-sized transfers. Most SATA disks will be lba48 anyway, even the
> ones smaller than 137GB, for this and other reasons.
>
> With this simple patch, the max request size goes from 128K to 32MB...
> so you can imagine this will definitely help performance. Throughput
> goes up. Interrupts go down. Fun for the whole family.
>

Hi Jeff,
I think 32MB is too much. You incur latency and lose
scheduling grainularity. I bet returns start diminishing
pretty quickly after 1MB or so.

2004-03-27 23:40:47

by Jeff Garzik

[permalink] [raw]
Subject: Re: [PATCH] speed up SATA

Jeff Garzik wrote:
> That's the main limitation on request size right now... libata limits
> S/G table entries to 128[1], so a perfectly aligned, fully merged

...

[1] because even though the block layer properly splits on segment
boundaries, pci_map_sg() may violate those boundaries (James B and
others are working on fixing this). So... for right now the driver
must check the s/g entry boundaries after DMA mapping, and split them
(again) if necessary. IDE does this in ide_build_dmatable().


2004-03-27 23:44:26

by Jeff Garzik

[permalink] [raw]
Subject: Re: [PATCH] speed up SATA

Nick Piggin wrote:
> I think 32MB is too much. You incur latency and lose
> scheduling grainularity. I bet returns start diminishing
> pretty quickly after 1MB or so.

See my reply to Bart.

Also, it is not the driver's responsibility to do anything but export
the hardware maximums.

It's up to the sysadmin to choose a disk scheduling policy they like,
which implies that a _scheduler_, not each individual driver, should
place policy limitations on max_sectors.

Jeff


2004-03-27 23:48:05

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH] speed up SATA

Jeff Garzik wrote:
> Nick Piggin wrote:
>
>> I think 32MB is too much. You incur latency and lose
>> scheduling grainularity. I bet returns start diminishing
>> pretty quickly after 1MB or so.
>
>
> See my reply to Bart.
>
> Also, it is not the driver's responsibility to do anything but export
> the hardware maximums.
>
> It's up to the sysadmin to choose a disk scheduling policy they like,
> which implies that a _scheduler_, not each individual driver, should
> place policy limitations on max_sectors.
>

Yeah I suppose you're right there. In practice it doesn't
work that way though, does it?

2004-03-28 00:00:02

by Jeff Garzik

[permalink] [raw]
Subject: Re: [PATCH] speed up SATA

Nick Piggin wrote:
> Jeff Garzik wrote:
>
>> Nick Piggin wrote:
>>
>>> I think 32MB is too much. You incur latency and lose
>>> scheduling grainularity. I bet returns start diminishing
>>> pretty quickly after 1MB or so.
>>
>>
>>
>> See my reply to Bart.
>>
>> Also, it is not the driver's responsibility to do anything but export
>> the hardware maximums.
>>
>> It's up to the sysadmin to choose a disk scheduling policy they like,
>> which implies that a _scheduler_, not each individual driver, should
>> place policy limitations on max_sectors.
>>
>
> Yeah I suppose you're right there. In practice it doesn't
> work that way though, does it?

Not my problem <grin>

People shouldn't be tuning max_sectors at the source code level: that
just embeds the policy decisions in the source code, and leads to
constant fiddling with the driver to get things "just right". Over time,
disks get faster and latency falls naturally. Thus the definition of
"just right" must be constantly tuned in the driver source code as time
passes.

I also wouldn't want to lock out any users who wanted to use SATA at
full speed ;-)

Jeff



Subject: Re: [PATCH] speed up SATA

On Sunday 28 of March 2004 00:40, Jeff Garzik wrote:
> Jeff Garzik wrote:
> > That's the main limitation on request size right now... libata limits
> > S/G table entries to 128[1], so a perfectly aligned, fully merged
>
> ...
>
> [1] because even though the block layer properly splits on segment
> boundaries, pci_map_sg() may violate those boundaries (James B and
> others are working on fixing this). So... for right now the driver
> must check the s/g entry boundaries after DMA mapping, and split them
> (again) if necessary. IDE does this in ide_build_dmatable().

You are right but small clarification is needed: code in ide_build_dmatable()
predates segment boundary support in block layer (IDE never relied on it).

2004-03-28 00:06:39

by Jeff Garzik

[permalink] [raw]
Subject: Re: [PATCH] speed up SATA

Jeff Garzik wrote:
> It's up to the sysadmin to choose a disk scheduling policy they like,
> which implies that a _scheduler_, not each individual driver, should
> place policy limitations on max_sectors.


<tangent>

The block layer / scheduler guys should also think about a general (not
SCSI specific) way to tune TCQ tag depth. That's IMO another policy
decision.

I'm about to add a raft of SATA-2 hardware, all of which are queued.
The standard depth is 32, but one board supports a whopping depth of 256.

Given past discussion on the topic, you probably don't want to queue 256
requests at a time to hardware :) But the sysadmin should be allowed
to, if they wish...

Jeff



2004-03-28 00:09:20

by Jeff Garzik

[permalink] [raw]
Subject: Re: [PATCH] speed up SATA

Bartlomiej Zolnierkiewicz wrote:
> On Sunday 28 of March 2004 00:40, Jeff Garzik wrote:
>
>>Jeff Garzik wrote:
>>
>>>That's the main limitation on request size right now... libata limits
>>>S/G table entries to 128[1], so a perfectly aligned, fully merged
>>
>> ...
>>
>>[1] because even though the block layer properly splits on segment
>>boundaries, pci_map_sg() may violate those boundaries (James B and
>>others are working on fixing this). So... for right now the driver
>>must check the s/g entry boundaries after DMA mapping, and split them
>>(again) if necessary. IDE does this in ide_build_dmatable().
>
>
> You are right but small clarification is needed: code in ide_build_dmatable()
> predates segment boundary support in block layer (IDE never relied on it).

Agreed... I'm saying it's still needed.

When the iommu layer knows about the boundaries it should respect, that
code can be removed from libata and drivers/ide, IMO... But also
double-check and make sure IDE driver supports the worst case, by
limiting to 128 PRD entries, not 256.

Jeff



2004-03-28 00:11:54

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] speed up SATA

Nick Piggin <[email protected]> wrote:
>
> >
> > With this simple patch, the max request size goes from 128K to 32MB...
> > so you can imagine this will definitely help performance. Throughput
> > goes up. Interrupts go down. Fun for the whole family.
> >
>
> Hi Jeff,
> I think 32MB is too much. You incur latency and lose
> scheduling grainularity. I bet returns start diminishing
> pretty quickly after 1MB or so.

As far as interactivity and throughput is concerned, the effect of really
big requests will be the same as the effect of permitting _more_ requests.
Namely: more memory can be under readahead or writeback at any particular
point in time.

Note that users can cause a similar effect right now by a) increasing
nr_requests or b) installing lots of disks.

The VM/VFS is pretty good at controlling this: the dirty thresholds are
really "dirty+writeback" thresholds. We do place firm limits on the amount
of dirty+writeback memory.

When you run with a really large nr_requests you can indeed have 40% of
your machine's memory under writeback with just a single disk, and some
benchmarks do take a hit. Mainly because truncate latency increases. But
for real-life things, it just doesn't make much difference.

So I think the change will be OK.

If something bad does happen, the user can reduce nr_requests, or reduce
dirty_ratio or we can teach the VFS to clamp the amounts of dirty and
writeback memory separately rather than lumping them together for writer
throttling purposes.


Another effect of this change is that users can transiently pin larger
amounts of memory via O_DIRECT. But they can do that now, by performing
I/O to lots of disks at the same time. We'd need some form of system-wide
clamping in the direct-io code to address this. I don't know how easy such
a DoS exploit would be in practice.

2004-03-28 00:16:57

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH] speed up SATA

Jeff Garzik wrote:
> Jeff Garzik wrote:
>
>> It's up to the sysadmin to choose a disk scheduling policy they like,
>> which implies that a _scheduler_, not each individual driver, should
>> place policy limitations on max_sectors.
>
>
>
> <tangent>
>
> The block layer / scheduler guys should also think about a general (not
> SCSI specific) way to tune TCQ tag depth. That's IMO another policy
> decision.
>
> I'm about to add a raft of SATA-2 hardware, all of which are queued. The
> standard depth is 32, but one board supports a whopping depth of 256.
>
> Given past discussion on the topic, you probably don't want to queue 256
> requests at a time to hardware :) But the sysadmin should be allowed
> to, if they wish...

I think you could limit the number of in-flight requests quite
easily, even for drivers that do not use the block layer's
queueing functions.

Aside, you should make 2 or 4 tags the default though: that
still gives you the pipelining without sacrificing latency
and usibility.

The only area (I think) where large queues outperform the IO
scheduler are lots of parallel, scattered reads. I think this
is because the drive can immediately return cached data even
though it looks like a large seek to the IO scheduler.
(This is from testing on my single, old SCSI disk though.)

2004-03-28 00:21:15

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH] speed up SATA

Andrew Morton wrote:
> Nick Piggin <[email protected]> wrote:
>
>> >
>> > With this simple patch, the max request size goes from 128K to 32MB...
>> > so you can imagine this will definitely help performance. Throughput
>> > goes up. Interrupts go down. Fun for the whole family.
>> >
>>
>> Hi Jeff,
>> I think 32MB is too much. You incur latency and lose
>> scheduling grainularity. I bet returns start diminishing
>> pretty quickly after 1MB or so.
>
>
> As far as interactivity and throughput is concerned, the effect of really
> big requests will be the same as the effect of permitting _more_ requests.
> Namely: more memory can be under readahead or writeback at any particular
> point in time.
>

Not quite, because a single 32MB write might take what? 500ms to
complete... and the IO scheduler wants to stop writes after 60ms
if there are waiting reads.

And writes will be more likely to be submitted as large requests.

I think Jeff is right that in theory though, the drivers should
just export their capabilities, and the block layer or IO scheduler
should decide on the maximum size to actually use.

2004-03-28 00:49:32

by Jeff Garzik

[permalink] [raw]
Subject: Re: [PATCH] speed up SATA

Nick Piggin wrote:
> Aside, you should make 2 or 4 tags the default though: that
> still gives you the pipelining without sacrificing latency
> and usibility.

I would rather have a sane block layer default, that works without my
intervention :)

Ditto the max_sectors issue, the driver for the hardware w/ a 256-depth
queue should attempt to queue up to the hardware maximum. The block
layer can be smart and limit things further from there...

IIRC the AIC7xxx driver sets the queue depth to 200+ on one of my
drives. That may be 2.4, though...


> The only area (I think) where large queues outperform the IO
> scheduler are lots of parallel, scattered reads. I think this
> is because the drive can immediately return cached data even
> though it looks like a large seek to the IO scheduler.
> (This is from testing on my single, old SCSI disk though.)

Most ATA drives have their own internal disk schedulers, and all of the
major vendors seem to have done something useful with TCQ. Newer drives
definitely take advantage of the additional knowledge gained via
knowledge of the entire queue rather than just a single request.

The parallel-reads case you mention is definitely a winner with TCQ, but
writes are also, for ATA-specific reasons (among others): since ATA has
been one-command-at-a-time for so long, write caching became necessary
for decent performance. So ATA disks needed a decent internal IO
scheduler, just maintain a decent level of performance.

TCQ-on-write for ATA disks is yummy because you don't really know what
the heck the ATA disk is writing at the present time. By the time the
Linux disk scheduler gets around to deciding it has a nicely merged and
scheduled set of requests, it may be totally wrong for the disk's IO
scheduler. TCQ gives the disk a lot more power when the disk integrates
writes into its internal IO scheduling.

Jeff



2004-03-28 01:03:04

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] speed up SATA

Jeff Garzik <[email protected]> wrote:
>
> TCQ-on-write for ATA disks is yummy because you don't really know what
> the heck the ATA disk is writing at the present time. By the time the
> Linux disk scheduler gets around to deciding it has a nicely merged and
> scheduled set of requests, it may be totally wrong for the disk's IO
> scheduler. TCQ gives the disk a lot more power when the disk integrates
> writes into its internal IO scheduling.

Slightly beneficial for throughput, disastrous for latency.

It appears the only way we'll ever get this gross misdesign fixed is to add
a latency test to winbench.

2004-03-28 01:09:57

by Jeff Garzik

[permalink] [raw]
Subject: Re: [PATCH] speed up SATA

Andrew Morton wrote:
> Jeff Garzik <[email protected]> wrote:
>
>> TCQ-on-write for ATA disks is yummy because you don't really know what
>> the heck the ATA disk is writing at the present time. By the time the
>> Linux disk scheduler gets around to deciding it has a nicely merged and
>> scheduled set of requests, it may be totally wrong for the disk's IO
>> scheduler. TCQ gives the disk a lot more power when the disk integrates
>> writes into its internal IO scheduling.
>
>
> Slightly beneficial for throughput, disastrous for latency.

If the disk is smart there are surely opportunities for latency
optimization as well...


> It appears the only way we'll ever get this gross misdesign fixed is to add
> a latency test to winbench.

rotfl ;-) True that...

"IOPs" are what make a lot of storage peeps excited these days, so they
are being pushed in a low-latency direction anyway.

Jeff



2004-03-28 04:40:02

by Eric D. Mudama

[permalink] [raw]
Subject: Re: [PATCH] speed up SATA

On Sun, Mar 28 at 9:37, Nick Piggin wrote:
>I think 32MB is too much. You incur latency and lose
>scheduling grainularity. I bet returns start diminishing
>pretty quickly after 1MB or so.

32-MB requests are the best for raw throughput.

~15ms to land at your target location, then pure 50-60MB/sec for the .5
seconds it takes to finish the operation. (media limited at that point)

Sure, there's more latency, but I guess that is application dependant.

--eric


--
Eric D. Mudama
[email protected]

2004-03-28 06:57:05

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH] speed up SATA

Eric D. Mudama wrote:
> On Sun, Mar 28 at 9:37, Nick Piggin wrote:
>
>> I think 32MB is too much. You incur latency and lose
>> scheduling grainularity. I bet returns start diminishing
>> pretty quickly after 1MB or so.
>
>
> 32-MB requests are the best for raw throughput.
>
> ~15ms to land at your target location, then pure 50-60MB/sec for the .5
> seconds it takes to finish the operation. (media limited at that point)
>
> Sure, there's more latency, but I guess that is application dependant.
>

What about a queue depth of 2, and writing that 32MB in 1MB requests?

2004-03-28 07:24:15

by Stefan Smietanowski

[permalink] [raw]
Subject: Re: [PATCH] speed up SATA

Hi Jeff.

>> What will happen when a PATA disk lies behind a Marvel(ous) bridge, as
>> in most SATA disks today?
>
> Larger transfers work fine in PATA, too.

Correct me if I'm wrong but wasn't the max number of sectors in a
transfer limited to 255 instead of 256 because there were buggy
drives out there before ? I seem to recall something like that but
I could be wrong.

Of course I hope nobody in his right mind would take those old things,
slap a Marvel chip on it and sell it as new :) But someone else could
buy one of those SATA<->PATA converter cables with a Marvel on it and
run it with the old disk <shudder>. Wouldn't those cases bug out?

>> Is large transfers mandatory in the LBA48 spec and is LBA48 really
>> mandatory in SATA?
>
> Yes and no, in that order :) SATA doesn't mandate lba48, but it is
> highly unlikely that you will see SATA disk without lba48.

Naturally, see my comment above regarding SATA<->PATA converter cables.

> Regardless, libata supports what the drive supports. Older disks still
> work just fine.

Or.. should :)

// Stefan

2004-03-28 07:33:21

by Stefan Smietanowski

[permalink] [raw]
Subject: Re: [PATCH] speed up SATA

Hi Jeff.

> I'm about to add a raft of SATA-2 hardware, all of which are queued. The
> standard depth is 32, but one board supports a whopping depth of 256.

Speaking of which .. I just read an announcement that someone (of course
the name eludes me) announced a DVD Burner that's SATA.

Found it:

http://www.plextor.com/english/news/press/712SA_pr.htm

a) Are there provisions in the SATA (1) SPEC for support of
non-disk units?

b) if (strcmp(a, "no"))
Do you know anything about it, ie is it SATA1 or 2 or what?

c) Let's ponder one gets a unit like this - is it usable with
libata yet?

d) if (strcmp(c, "no"))
Will it? :)

// Stefan

2004-03-28 13:51:35

by Jamie Lokier

[permalink] [raw]
Subject: Re: [PATCH] speed up SATA

Jeff Garzik wrote:
> TCQ-on-write for ATA disks is yummy because you don't really know what
> the heck the ATA disk is writing at the present time. By the time the
> Linux disk scheduler gets around to deciding it has a nicely merged and
> scheduled set of requests, it may be totally wrong for the disk's IO
> scheduler. TCQ gives the disk a lot more power when the disk integrates
> writes into its internal IO scheduling.

Does TCQ-on-write allow you to do ordered write commits, as with barriers,
but without needing full cache flushes, and still get good performance?

In principle I think the answer is yes, but what is the answer with
real disks in practice?

-- Jamie

2004-03-28 13:59:18

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH] speed up SATA

On Sat, Mar 27 2004, Jeff Garzik wrote:
> "IOPs" are what make a lot of storage peeps excited these days, so they
> are being pushed in a low-latency direction anyway.

IOPS says nothing about latency, it's still 100% throughput oriented.

--
Jens Axboe

2004-03-28 14:08:18

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH] speed up SATA

On Sat, Mar 27 2004, Jeff Garzik wrote:
> Nick Piggin wrote:
> >I think 32MB is too much. You incur latency and lose
> >scheduling grainularity. I bet returns start diminishing
> >pretty quickly after 1MB or so.
>
> See my reply to Bart.
>
> Also, it is not the driver's responsibility to do anything but export
> the hardware maximums.

The problem is that what the 'reasonable size' is depends highly on the
hardware. The 32MB doesn't make any sense to me for SATA (or for
anything else, for that matter). Back-to-back 1MB requests (this is the
default I chose for PATA) should be practially identical for throughput,
and loads better for optimizing latencies. You cannot do much with 32MB
requests wrt latency...

So you could change ->max_sectors to be 'max allowable io, hardware
limitation' and add ->optimal_sectors to be 'best sized io'. I don't see
tha it buys you anything, since you'd be going for optimal_sectors all
the time anyways.

Additionally, a single bio cannot currently be bigger than 256 pages
(ie 1MB request on 4k page). This _could_ be increased of course, but
not beyond making ->bio_io_vec be bigger than a page. It's already
bigger than that on x86 in fact, if you use 64-bit dma_addr_t. For
32-bit dma_addr_t 256 entries fit a page perfectly. Merging can get you
bigger requests of course.

In summary, there needs to be some extremely good numbers and arguments
for changing any of this, so far I don't see anything except people
drooling over sending 32MB requests.

> It's up to the sysadmin to choose a disk scheduling policy they like,
> which implies that a _scheduler_, not each individual driver, should
> place policy limitations on max_sectors.

Not just scheduler, lower layers in general.

--
Jens Axboe

2004-03-28 14:10:23

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH] speed up SATA

On Sat, Mar 27 2004, Jeff Garzik wrote:
> I also wouldn't want to lock out any users who wanted to use SATA at
> full speed ;-)

And full speed requires 32MB requests?

--
Jens Axboe

Subject: Re: [PATCH] speed up SATA

On Sunday 28 of March 2004 09:23, Stefan Smietanowski wrote:
> Hi Jeff.
>
> >> What will happen when a PATA disk lies behind a Marvel(ous) bridge, as
> >> in most SATA disks today?
> >
> > Larger transfers work fine in PATA, too.
>
> Correct me if I'm wrong but wasn't the max number of sectors in a
> transfer limited to 255 instead of 256 because there were buggy
> drives out there before ? I seem to recall something like that but
> I could be wrong.

There was 255 sectors limit in IDE driver but it was based only on _one_
bugreport on _unreliable_ drive (256 sectors causes heavier I/O load).

AFAIR "the other OS" is using 256 sectors so this value is safe.

Regards,
Bartlomiej

2004-03-28 17:24:26

by Jeff Garzik

[permalink] [raw]
Subject: Re: [PATCH] speed up SATA

Jamie Lokier wrote:
> Jeff Garzik wrote:
>
>>TCQ-on-write for ATA disks is yummy because you don't really know what
>>the heck the ATA disk is writing at the present time. By the time the
>>Linux disk scheduler gets around to deciding it has a nicely merged and
>>scheduled set of requests, it may be totally wrong for the disk's IO
>>scheduler. TCQ gives the disk a lot more power when the disk integrates
>>writes into its internal IO scheduling.
>
>
> Does TCQ-on-write allow you to do ordered write commits, as with barriers,
> but without needing full cache flushes, and still get good performance?

Nope, TCQ is just a bunch of commands rather than one. There are no
special barrier indicators you can pass down with a command.

Jeff




2004-03-28 17:29:27

by Jeff Garzik

[permalink] [raw]
Subject: Re: [PATCH] speed up SATA

Jens Axboe wrote:
> On Sat, Mar 27 2004, Jeff Garzik wrote:
>
>>"IOPs" are what make a lot of storage peeps excited these days, so they
>>are being pushed in a low-latency direction anyway.
>
>
> IOPS says nothing about latency, it's still 100% throughput oriented.


In order to achieve a large IOPs number, one must sent a ton of small
requests. Per-request latency as well as overall throughput is a factor.

Jeff



2004-03-28 17:32:02

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH] speed up SATA

On Sun, Mar 28 2004, Jeff Garzik wrote:
> Jens Axboe wrote:
> >On Sat, Mar 27 2004, Jeff Garzik wrote:
> >
> >>"IOPs" are what make a lot of storage peeps excited these days, so they
> >>are being pushed in a low-latency direction anyway.
> >
> >
> >IOPS says nothing about latency, it's still 100% throughput oriented.
>
>
> In order to achieve a large IOPs number, one must sent a ton of small
> requests. Per-request latency as well as overall throughput is a factor.

I don't see how per-request latency is a factor at all, if your metric
is simply iops/sec. You could infinitely starve lots of requests
without hurting your io rate, nor throughput.

--
Jens Axboe

2004-03-28 17:32:03

by Jeff Garzik

[permalink] [raw]
Subject: Re: [PATCH] speed up SATA

Jens Axboe wrote:
> On Sat, Mar 27 2004, Jeff Garzik wrote:
>
>>I also wouldn't want to lock out any users who wanted to use SATA at
>>full speed ;-)
>
>
> And full speed requires 32MB requests?


Full speed is the SATA driver supporting the hardware maximum. The
block layer and general fragmentation limit things further from there.

Jeff



2004-03-28 17:37:52

by Jamie Lokier

[permalink] [raw]
Subject: Re: [PATCH] speed up SATA

Jeff Garzik wrote:
> >Does TCQ-on-write allow you to do ordered write commits, as with barriers,
> >but without needing full cache flushes, and still get good performance?
>
> Nope, TCQ is just a bunch of commands rather than one. There are no
> special barrier indicators you can pass down with a command.

I meant without barrier indicators (although that would've been nice).

This is what I mean: turn off write cacheing, and performance on PATA
drops because of the serialisation and lost inter-command time.

With TCQ-on-write, you can turn off write cacheing and in theory
performance doesn't have to drop, is that right?

In addition, you can implement ordered writes by waiting until
"before" writes in a partial order are completed prior to sending
"after" writes to the drive. Meanwhile, because of the TCQ, other
read and write transactions can continue to take place, even if the
disk takes a long time to commit those writes that you're waiting on.

I'm wondering if that sort of strategy gives good performance with
TCQ-on-write drives, so that full-cache-flush barrier commands aren't
useful.

-- Jamie

2004-03-28 17:38:06

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH] speed up SATA

On Sun, Mar 28 2004, Jeff Garzik wrote:
> Jens Axboe wrote:
> >On Sat, Mar 27 2004, Jeff Garzik wrote:
> >
> >>I also wouldn't want to lock out any users who wanted to use SATA at
> >>full speed ;-)
> >
> >
> >And full speed requires 32MB requests?
>
>
> Full speed is the SATA driver supporting the hardware maximum. The

Come on Jeff, don't be such a slave to the hardware specifications. Just
because it's possible to send down 32MB requests doesn't necessarily
mean it's a super thing to do, nor that it automagically makes 'things
go faster'. The claim is that back-to-back 1MB requests are every bit as
fast as a 32MB request (especially if you have a small queue depth, in
that case there truly should be zero benefit to doing the bigger ones).
The cut-off point is likely even lower than 1MB, I'm just using that
figure as a value that is 'pretty big' yet doesn't incur too large
latencies just because of its size.

--
Jens Axboe

2004-03-28 17:41:28

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH] speed up SATA

On Sun, Mar 28 2004, Jeff Garzik wrote:
> Jamie Lokier wrote:
> >Jeff Garzik wrote:
> >
> >>TCQ-on-write for ATA disks is yummy because you don't really know what
> >>the heck the ATA disk is writing at the present time. By the time the
> >>Linux disk scheduler gets around to deciding it has a nicely merged and
> >>scheduled set of requests, it may be totally wrong for the disk's IO
> >>scheduler. TCQ gives the disk a lot more power when the disk integrates
> >>writes into its internal IO scheduling.
> >
> >
> >Does TCQ-on-write allow you to do ordered write commits, as with barriers,
> >but without needing full cache flushes, and still get good performance?
>
> Nope, TCQ is just a bunch of commands rather than one. There are no
> special barrier indicators you can pass down with a command.

What would be nice (and I seem to recall that Andre also pushed for
this) would be the FUA bit doubling as an ordered tag indicator when
using TCQ. It's one of those things that keep ATA squarely outside of
the big machine uses. That other OS had a differing opinion of what to
do with that, so...

--
Jens Axboe

2004-03-28 17:41:56

by Jeff Garzik

[permalink] [raw]
Subject: Re: [PATCH] speed up SATA

Jens Axboe wrote:
> On Sat, Mar 27 2004, Jeff Garzik wrote:
>
>>Nick Piggin wrote:
>>
>>>I think 32MB is too much. You incur latency and lose
>>>scheduling grainularity. I bet returns start diminishing
>>>pretty quickly after 1MB or so.
>>
>>See my reply to Bart.
>>
>>Also, it is not the driver's responsibility to do anything but export
>>the hardware maximums.
>
>
> The problem is that what the 'reasonable size' is depends highly on the
> hardware. The 32MB doesn't make any sense to me for SATA (or for
> anything else, for that matter). Back-to-back 1MB requests (this is the
> default I chose for PATA) should be practially identical for throughput,
> and loads better for optimizing latencies. You cannot do much with 32MB
> requests wrt latency...
>
> So you could change ->max_sectors to be 'max allowable io, hardware
> limitation' and add ->optimal_sectors to be 'best sized io'. I don't see
> tha it buys you anything, since you'd be going for optimal_sectors all
> the time anyways.
>
> Additionally, a single bio cannot currently be bigger than 256 pages
> (ie 1MB request on 4k page). This _could_ be increased of course, but
> not beyond making ->bio_io_vec be bigger than a page. It's already
> bigger than that on x86 in fact, if you use 64-bit dma_addr_t. For
> 32-bit dma_addr_t 256 entries fit a page perfectly. Merging can get you
> bigger requests of course.
>
> In summary, there needs to be some extremely good numbers and arguments
> for changing any of this, so far I don't see anything except people
> drooling over sending 32MB requests.

I think you're way too stuck on the 32MB number. Other limitations
already limit that further.

If and when the upper layers pass down such requests, my driver can
handle that. For today, various external limitations and factors --
such as what you describe above -- mean the driver won't be getting
anywhere near 32MB requests. And my driver can handle that just fine too.

Jeff




2004-03-28 17:46:01

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH] speed up SATA

On Sun, Mar 28 2004, Jeff Garzik wrote:
> Jens Axboe wrote:
> >On Sat, Mar 27 2004, Jeff Garzik wrote:
> >
> >>Nick Piggin wrote:
> >>
> >>>I think 32MB is too much. You incur latency and lose
> >>>scheduling grainularity. I bet returns start diminishing
> >>>pretty quickly after 1MB or so.
> >>
> >>See my reply to Bart.
> >>
> >>Also, it is not the driver's responsibility to do anything but export
> >>the hardware maximums.
> >
> >
> >The problem is that what the 'reasonable size' is depends highly on the
> >hardware. The 32MB doesn't make any sense to me for SATA (or for
> >anything else, for that matter). Back-to-back 1MB requests (this is the
> >default I chose for PATA) should be practially identical for throughput,
> >and loads better for optimizing latencies. You cannot do much with 32MB
> >requests wrt latency...
> >
> >So you could change ->max_sectors to be 'max allowable io, hardware
> >limitation' and add ->optimal_sectors to be 'best sized io'. I don't see
> >tha it buys you anything, since you'd be going for optimal_sectors all
> >the time anyways.
> >
> >Additionally, a single bio cannot currently be bigger than 256 pages
> >(ie 1MB request on 4k page). This _could_ be increased of course, but
> >not beyond making ->bio_io_vec be bigger than a page. It's already
> >bigger than that on x86 in fact, if you use 64-bit dma_addr_t. For
> >32-bit dma_addr_t 256 entries fit a page perfectly. Merging can get you
> >bigger requests of course.
> >
> >In summary, there needs to be some extremely good numbers and arguments
> >for changing any of this, so far I don't see anything except people
> >drooling over sending 32MB requests.
>
> I think you're way too stuck on the 32MB number. Other limitations
> already limit that further.

Of course I am, this is what you are proposing as the 'optimal io size'
for SATA. I already explain that ->max_sectors is more of a 'best io
size' not 'hardware limit' setting. So if you set that to 32MB, then
that is what you are indicating.

> If and when the upper layers pass down such requests, my driver can
> handle that. For today, various external limitations and factors --
> such as what you describe above -- mean the driver won't be getting
> anywhere near 32MB requests. And my driver can handle that just fine
> too.

In that light, I don't think your change makes any sense whatsoever. You
have a different understanding of what max_sectors member does, that's
basically the heart of the discussion.

I don't think it would be too hard to send down a 32MB request if you
just bypass the file system, in fact I would have thought you would have
benched this long before proposing such a modification.

--
Jens Axboe

2004-03-28 17:48:39

by Jeff Garzik

[permalink] [raw]
Subject: Re: [PATCH] speed up SATA

Jens Axboe wrote:
> On Sun, Mar 28 2004, Jeff Garzik wrote:
>
>>Jens Axboe wrote:
>>
>>>On Sat, Mar 27 2004, Jeff Garzik wrote:
>>>
>>>
>>>>I also wouldn't want to lock out any users who wanted to use SATA at
>>>>full speed ;-)
>>>
>>>
>>>And full speed requires 32MB requests?
>>
>>
>>Full speed is the SATA driver supporting the hardware maximum. The
>
>
> Come on Jeff, don't be such a slave to the hardware specifications. Just
> because it's possible to send down 32MB requests doesn't necessarily
> mean it's a super thing to do, nor that it automagically makes 'things
> go faster'. The claim is that back-to-back 1MB requests are every bit as
> fast as a 32MB request (especially if you have a small queue depth, in
> that case there truly should be zero benefit to doing the bigger ones).
> The cut-off point is likely even lower than 1MB, I'm just using that
> figure as a value that is 'pretty big' yet doesn't incur too large
> latencies just because of its size.


For me this is a policy issue.

I agree that huge requst hurt latency. I just disagree that the
_driver_ should artificially lower its maximums to fit a guess about
what the best request size should be.

If there needs to be an overall limit on per-size size, do it at the
block layer. It's not scalable to hardcode that limit into every
driver. That's not the driver's job. The driver just exports the
hardware limits, nothing more.

A limit is fine. I support that. An artificial limit in the driver is not.

Jeff



2004-03-28 17:50:40

by Jeff Garzik

[permalink] [raw]
Subject: Re: [PATCH] speed up SATA

Jens Axboe wrote:
> What would be nice (and I seem to recall that Andre also pushed for
> this) would be the FUA bit doubling as an ordered tag indicator when
> using TCQ. It's one of those things that keep ATA squarely outside of
> the big machine uses. That other OS had a differing opinion of what to
> do with that, so...

Preach on, brother Jens :)

I agree completely. Or, the ATA guys could use SCSI's ordered tags /
linked commands.

Regardless, there's ATA dain bramage that needs fixing... Sigh.

Jeff



2004-03-28 17:54:51

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH] speed up SATA

On Sun, Mar 28 2004, Jeff Garzik wrote:
> Jens Axboe wrote:
> >On Sun, Mar 28 2004, Jeff Garzik wrote:
> >
> >>Jens Axboe wrote:
> >>
> >>>On Sat, Mar 27 2004, Jeff Garzik wrote:
> >>>
> >>>
> >>>>I also wouldn't want to lock out any users who wanted to use SATA at
> >>>>full speed ;-)
> >>>
> >>>
> >>>And full speed requires 32MB requests?
> >>
> >>
> >>Full speed is the SATA driver supporting the hardware maximum. The
> >
> >
> >Come on Jeff, don't be such a slave to the hardware specifications. Just
> >because it's possible to send down 32MB requests doesn't necessarily
> >mean it's a super thing to do, nor that it automagically makes 'things
> >go faster'. The claim is that back-to-back 1MB requests are every bit as
> >fast as a 32MB request (especially if you have a small queue depth, in
> >that case there truly should be zero benefit to doing the bigger ones).
> >The cut-off point is likely even lower than 1MB, I'm just using that
> >figure as a value that is 'pretty big' yet doesn't incur too large
> >latencies just because of its size.
>
>
> For me this is a policy issue.
>
> I agree that huge requst hurt latency. I just disagree that the
> _driver_ should artificially lower its maximums to fit a guess about
> what the best request size should be.
>
> If there needs to be an overall limit on per-size size, do it at the
> block layer. It's not scalable to hardcode that limit into every
> driver. That's not the driver's job. The driver just exports the
> hardware limits, nothing more.
>
> A limit is fine. I support that. An artificial limit in the driver
> is not.

Sorry, but I cannot disagree more. You think an artificial limit at the
block layer is better than one imposed at the driver end, which actually
has a lot more of an understanding of what hardware it is driving? This
makes zero sense to me. Take floppy.c for instance, I really don't want
1MB requests there, since that would take a minute to complete. And I
might not want 1MB requests on my Super-ZXY storage, because that beast
completes io easily at an iorate of 200MB/sec.

So you want to put this _policy_ in the block layer, instead of in the
driver. That's an even worse decision if your reasoning is policy. The
only such limits I would want to put in, are those of the bio where
simply is best to keep that small and contained within a single page to
avoid higher order allocations to do io. Limits based on general sound
principles, not something that caters to some particular piece of
hardware. I absolutely refuse to put a global block layer 'optimal io
size' restriction in, since that is the ugliest of policies and without
having _any_ knowledge of what the hardware can do.

--
Jens Axboe

2004-03-28 17:54:38

by Jeff Garzik

[permalink] [raw]
Subject: Re: [PATCH] speed up SATA

Jamie Lokier wrote:
> This is what I mean: turn off write cacheing, and performance on PATA
> drops because of the serialisation and lost inter-command time.
>
> With TCQ-on-write, you can turn off write cacheing and in theory
> performance doesn't have to drop, is that right?

hmmm, that's a good point. I honestly don't know. Something to be
tested, I suppose...

My premature guess would be that you would need a queue depth larger
than 2 or 4 before performance starts to not-suck.

Jeff



2004-03-28 17:56:53

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH] speed up SATA

On Sun, Mar 28 2004, Jeff Garzik wrote:
> Jens Axboe wrote:
> >What would be nice (and I seem to recall that Andre also pushed for
> >this) would be the FUA bit doubling as an ordered tag indicator when
> >using TCQ. It's one of those things that keep ATA squarely outside of
> >the big machine uses. That other OS had a differing opinion of what to
> >do with that, so...
>
> Preach on, brother Jens :)

I think we already lost this one, I'm afraid :-)

> I agree completely. Or, the ATA guys could use SCSI's ordered tags /
> linked commands.
>
> Regardless, there's ATA dain bramage that needs fixing... Sigh.

Indeed, and it really hurt that they passed up this oportunity last
time, ATA TCQ would have kicked so much more ass... Maybe Eric can beat
some sense into his colleagues.

--
Jens Axboe

2004-03-28 18:04:42

by Jeff Garzik

[permalink] [raw]
Subject: Re: [PATCH] speed up SATA

Jens Axboe wrote:
> On Sun, Mar 28 2004, Jeff Garzik wrote:
>
>>Jens Axboe wrote:
>>
>>>What would be nice (and I seem to recall that Andre also pushed for
>>>this) would be the FUA bit doubling as an ordered tag indicator when
>>>using TCQ. It's one of those things that keep ATA squarely outside of
>>>the big machine uses. That other OS had a differing opinion of what to
>>>do with that, so...
>>
>>Preach on, brother Jens :)
>
>
> I think we already lost this one, I'm afraid :-)
>
>
>>I agree completely. Or, the ATA guys could use SCSI's ordered tags /
>>linked commands.
>>
>>Regardless, there's ATA dain bramage that needs fixing... Sigh.
>
>
> Indeed, and it really hurt that they passed up this oportunity last
> time, ATA TCQ would have kicked so much more ass... Maybe Eric can beat
> some sense into his colleagues.


I bet if we can come up with a decent proposal, with technical rationale
for the change... that could be presented to the right ATA people :)
It's worth a shot.

Jeff



2004-03-28 18:08:28

by Jamie Lokier

[permalink] [raw]
Subject: Re: [PATCH] speed up SATA

Jens Axboe wrote:
> Sorry, but I cannot disagree more. You think an artificial limit at
> the block layer is better than one imposed at the driver end, which
> actually has a lot more of an understanding of what hardware it is
> driving? This makes zero sense to me. Take floppy.c for instance, I
> really don't want 1MB requests there, since that would take a minute
> to complete. And I might not want 1MB requests on my Super-ZXY
> storage, because that beast completes io easily at an iorate of
> 200MB/sec.

The driver doesn't know how fast the drive is either.

Without timing the drive, interface, and for different request sizes,
neither the block layer _nor_ the driver know a suitable size.

> I absolutely refuse to put a global block layer 'optimal io
> size' restriction in, since that is the ugliest of policies and
> without having _any_ knowledge of what the hardware can do.

But the driver doesn't have _any_ knowledge of what the I/O scheduler
wants. 1MB requests may be a cut-off above which there is negligable
throughput gain for SATA, but those requests may be _far_ too large
for a low-latency I/O scheduling requirement.

If we have a high-level latency scheduling constraint that userspace
should be able to issue a read and get the result within 50ms, or that
the average latency for reads should be <500ms, how does the SATA
driver limiting requests to 1MB help? It depends on the attached drive.

The fundamental problem here is that neither the driver nor the block
layer have all the information needed to select optimal or maximum
request sizes. That can only be found by timing the device, perhaps
every time a request is made, and adjusting the I/O scheduling and
request splitting parameters according to that timing and high-level
latency requirements.

>From that point of view, the generic block layer is exactly the right
place to determine those parameters, because the calculation is not
device-specific.

-- Jamie

2004-03-28 18:09:30

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH] speed up SATA

On Sun, Mar 28 2004, Jeff Garzik wrote:
> >>I agree completely. Or, the ATA guys could use SCSI's ordered tags /
> >>linked commands.
> >>
> >>Regardless, there's ATA dain bramage that needs fixing... Sigh.
> >
> >
> >Indeed, and it really hurt that they passed up this oportunity last
> >time, ATA TCQ would have kicked so much more ass... Maybe Eric can beat
> >some sense into his colleagues.
>
>
> I bet if we can come up with a decent proposal, with technical rationale
> for the change... that could be presented to the right ATA people :)
> It's worth a shot.

Count me in for that. The ATA people seem to have a preference for
adding a new set of commands for this type of there, where as I
(originally, I did actually send in a proposal like this on the ml)
wanted to just use one of the reserved bits in the tag field.

--
Jens Axboe

2004-03-28 18:14:38

by William Lee Irwin III

[permalink] [raw]
Subject: Re: [PATCH] speed up SATA

On Sun, Mar 28, 2004 at 07:54:36PM +0200, Jens Axboe wrote:
> Sorry, but I cannot disagree more. You think an artificial limit at the
> block layer is better than one imposed at the driver end, which actually
> has a lot more of an understanding of what hardware it is driving? This
> makes zero sense to me. Take floppy.c for instance, I really don't want
> 1MB requests there, since that would take a minute to complete. And I
> might not want 1MB requests on my Super-ZXY storage, because that beast
> completes io easily at an iorate of 200MB/sec.
> So you want to put this _policy_ in the block layer, instead of in the
> driver. That's an even worse decision if your reasoning is policy. The
> only such limits I would want to put in, are those of the bio where
> simply is best to keep that small and contained within a single page to
> avoid higher order allocations to do io. Limits based on general sound
> principles, not something that caters to some particular piece of
> hardware. I absolutely refuse to put a global block layer 'optimal io
> size' restriction in, since that is the ugliest of policies and without
> having _any_ knowledge of what the hardware can do.

How about per-device policies and driver hints wrt. optimal io?


-- wli

2004-03-28 18:16:42

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH] speed up SATA

On Sun, Mar 28 2004, Jamie Lokier wrote:
> Jens Axboe wrote:
> > Sorry, but I cannot disagree more. You think an artificial limit at
> > the block layer is better than one imposed at the driver end, which
> > actually has a lot more of an understanding of what hardware it is
> > driving? This makes zero sense to me. Take floppy.c for instance, I
> > really don't want 1MB requests there, since that would take a minute
> > to complete. And I might not want 1MB requests on my Super-ZXY
> > storage, because that beast completes io easily at an iorate of
> > 200MB/sec.
>
> The driver doesn't know how fast the drive is either.
>
> Without timing the drive, interface, and for different request sizes,
> neither the block layer _nor_ the driver know a suitable size.

The driver may not know exactly, but it does know a ball park figure.
You know if you are driving floppy (sucky transfer and latency), hard
drive, cdrom (decent transfer, sucky seeks), etc.

> > I absolutely refuse to put a global block layer 'optimal io
> > size' restriction in, since that is the ugliest of policies and
> > without having _any_ knowledge of what the hardware can do.
>
> But the driver doesn't have _any_ knowledge of what the I/O scheduler
> wants. 1MB requests may be a cut-off above which there is negligable

It's not what the io scheduler wants, it's what you can provide at a
reasonable latency. You cannot preempt that unit of io.

> throughput gain for SATA, but those requests may be _far_ too large
> for a low-latency I/O scheduling requirement.
>
> If we have a high-level latency scheduling constraint that userspace
> should be able to issue a read and get the result within 50ms, or that
> the average latency for reads should be <500ms, how does the SATA
> driver limiting requests to 1MB help? It depends on the attached drive.

Yep it sure does, but try and find a drive attached to a SATA controller
that cannot do 40MiB/sec (or something like that). Storage doesn't move
_that_ fast, you can keep up.

> The fundamental problem here is that neither the driver nor the block
> layer have all the information needed to select optimal or maximum
> request sizes. That can only be found by timing the device, perhaps
> every time a request is made, and adjusting the I/O scheduling and
> request splitting parameters according to that timing and high-level
> latency requirements.

I agree with that, completely. And I still maintain that putting the
restriction blindly into the hands of the block layer is not a good
idea. The driver may not know completely what storage is attached to it,
but it can peek and poke and get a general idea. As it stands right now,
the block layer has _zero_ knowledge. Unless you start adding timing and
imposing max request size based on the latencies. If you do that, then I
would be quite happy with changing ->max_sectors to be the hardware
limit.

> >From that point of view, the generic block layer is exactly the right
> place to determine those parameters, because the calculation is not
> device-specific.

If you start adding that type of code. That's a different discussion
than this one, though, and it would raise a new set of problems (AS io
scheduler already does some of this privately).

--
Jens Axboe

2004-03-28 18:17:41

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH] speed up SATA

On Sun, Mar 28 2004, William Lee Irwin III wrote:
> On Sun, Mar 28, 2004 at 07:54:36PM +0200, Jens Axboe wrote:
> > Sorry, but I cannot disagree more. You think an artificial limit at the
> > block layer is better than one imposed at the driver end, which actually
> > has a lot more of an understanding of what hardware it is driving? This
> > makes zero sense to me. Take floppy.c for instance, I really don't want
> > 1MB requests there, since that would take a minute to complete. And I
> > might not want 1MB requests on my Super-ZXY storage, because that beast
> > completes io easily at an iorate of 200MB/sec.
> > So you want to put this _policy_ in the block layer, instead of in the
> > driver. That's an even worse decision if your reasoning is policy. The
> > only such limits I would want to put in, are those of the bio where
> > simply is best to keep that small and contained within a single page to
> > avoid higher order allocations to do io. Limits based on general sound
> > principles, not something that caters to some particular piece of
> > hardware. I absolutely refuse to put a global block layer 'optimal io
> > size' restriction in, since that is the ugliest of policies and without
> > having _any_ knowledge of what the hardware can do.
>
> How about per-device policies and driver hints wrt. optimal io?

That would be fine, it's what I suggested in an earlier email. In the
future Jamie's suggestion is probably the one that makes the most sense
- just keep a per-driver limit setting which informs the block layer of
max sectors the hardware truly can do, and try and time request
execution if you care about latencies.

But lets not forget the original question, which is when and if 32MB
request make sense at all. Right now they probably don't.

--
Jens Axboe

Subject: Re: [PATCH] speed up SATA

On Sunday 28 of March 2004 20:12, William Lee Irwin III wrote:
> On Sun, Mar 28, 2004 at 07:54:36PM +0200, Jens Axboe wrote:
> > Sorry, but I cannot disagree more. You think an artificial limit at the
> > block layer is better than one imposed at the driver end, which actually
> > has a lot more of an understanding of what hardware it is driving? This
> > makes zero sense to me. Take floppy.c for instance, I really don't want
> > 1MB requests there, since that would take a minute to complete. And I
> > might not want 1MB requests on my Super-ZXY storage, because that beast
> > completes io easily at an iorate of 200MB/sec.
> > So you want to put this _policy_ in the block layer, instead of in the
> > driver. That's an even worse decision if your reasoning is policy. The
> > only such limits I would want to put in, are those of the bio where
> > simply is best to keep that small and contained within a single page to
> > avoid higher order allocations to do io. Limits based on general sound
> > principles, not something that caters to some particular piece of
> > hardware. I absolutely refuse to put a global block layer 'optimal io
> > size' restriction in, since that is the ugliest of policies and without
> > having _any_ knowledge of what the hardware can do.
>
> How about per-device policies and driver hints wrt. optimal io?

Yep, user-tunable per-device policies with sane driver defaults.

2004-03-28 18:31:44

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH] speed up SATA

On Sun, Mar 28 2004, Bartlomiej Zolnierkiewicz wrote:
> On Sunday 28 of March 2004 20:12, William Lee Irwin III wrote:
> > On Sun, Mar 28, 2004 at 07:54:36PM +0200, Jens Axboe wrote:
> > > Sorry, but I cannot disagree more. You think an artificial limit at the
> > > block layer is better than one imposed at the driver end, which actually
> > > has a lot more of an understanding of what hardware it is driving? This
> > > makes zero sense to me. Take floppy.c for instance, I really don't want
> > > 1MB requests there, since that would take a minute to complete. And I
> > > might not want 1MB requests on my Super-ZXY storage, because that beast
> > > completes io easily at an iorate of 200MB/sec.
> > > So you want to put this _policy_ in the block layer, instead of in the
> > > driver. That's an even worse decision if your reasoning is policy. The
> > > only such limits I would want to put in, are those of the bio where
> > > simply is best to keep that small and contained within a single page to
> > > avoid higher order allocations to do io. Limits based on general sound
> > > principles, not something that caters to some particular piece of
> > > hardware. I absolutely refuse to put a global block layer 'optimal io
> > > size' restriction in, since that is the ugliest of policies and without
> > > having _any_ knowledge of what the hardware can do.
> >
> > How about per-device policies and driver hints wrt. optimal io?
>
> Yep, user-tunable per-device policies with sane driver defaults.

BTW, these are trivial to expose through sysfs as their as inside the
queue already.

Making something user tunable is usually not the best idea, if you can
deduct these things automagically instead. So whether this is the best
idea, depends on which way you want to go.

--
Jens Axboe

Subject: Re: [PATCH] speed up SATA

On Sunday 28 of March 2004 20:30, Jens Axboe wrote:
> On Sun, Mar 28 2004, Bartlomiej Zolnierkiewicz wrote:
> > On Sunday 28 of March 2004 20:12, William Lee Irwin III wrote:
> > > On Sun, Mar 28, 2004 at 07:54:36PM +0200, Jens Axboe wrote:
> > > > Sorry, but I cannot disagree more. You think an artificial limit at
> > > > the block layer is better than one imposed at the driver end, which
> > > > actually has a lot more of an understanding of what hardware it is
> > > > driving? This makes zero sense to me. Take floppy.c for instance, I
> > > > really don't want 1MB requests there, since that would take a minute
> > > > to complete. And I might not want 1MB requests on my Super-ZXY
> > > > storage, because that beast completes io easily at an iorate of
> > > > 200MB/sec.
> > > > So you want to put this _policy_ in the block layer, instead of in
> > > > the driver. That's an even worse decision if your reasoning is
> > > > policy. The only such limits I would want to put in, are those of the
> > > > bio where simply is best to keep that small and contained within a
> > > > single page to avoid higher order allocations to do io. Limits based
> > > > on general sound principles, not something that caters to some
> > > > particular piece of hardware. I absolutely refuse to put a global
> > > > block layer 'optimal io size' restriction in, since that is the
> > > > ugliest of policies and without having _any_ knowledge of what the
> > > > hardware can do.
> > >
> > > How about per-device policies and driver hints wrt. optimal io?
> >
> > Yep, user-tunable per-device policies with sane driver defaults.
>
> BTW, these are trivial to expose through sysfs as their as inside the
> queue already.

Yep, yep.

> Making something user tunable is usually not the best idea, if you can
> deduct these things automagically instead. So whether this is the best
> idea, depends on which way you want to go.

I think it's the best idea for now, long-term we are better with automagic.

Bartlomiej

2004-03-28 18:56:21

by Jeff Garzik

[permalink] [raw]
Subject: Re: [PATCH] speed up SATA

Jens Axboe wrote:
> On Sun, Mar 28 2004, Jamie Lokier wrote:
>
>>Jens Axboe wrote:
>>
>>>Sorry, but I cannot disagree more. You think an artificial limit at
>>>the block layer is better than one imposed at the driver end, which
>>>actually has a lot more of an understanding of what hardware it is
>>>driving? This makes zero sense to me. Take floppy.c for instance, I
>>>really don't want 1MB requests there, since that would take a minute
>>>to complete. And I might not want 1MB requests on my Super-ZXY
>>>storage, because that beast completes io easily at an iorate of
>>>200MB/sec.
>>
>>The driver doesn't know how fast the drive is either.
>>
>>Without timing the drive, interface, and for different request sizes,
>>neither the block layer _nor_ the driver know a suitable size.

Nod, this is pretty much my objection to hardcoding an artificial limit
in the driver...


> The driver may not know exactly, but it does know a ball park figure.
> You know if you are driving floppy (sucky transfer and latency), hard
> drive, cdrom (decent transfer, sucky seeks), etc.

Agreed. Really we have two types of information:
* the device's hard limit
* the default limit that should be applied to that class of devices

I would much rather do something like

blk_queue_set_class(q, CLASS_DISK)

and have that default per-class policy

switch (class) {
case CLASS_DISK:
q->max_sectors = min(q->max_sectors, CLASS_DISK_MAX_SECTORS);
...

than hardcode the limit in the driver. That's easy and quick. That's a
minimal solution that gives me what I want -- don't hardcode generic
limits in the driver -- while IMO giving you what you want, a sane limit
in an easy way.

Right now we are hardcoding the same per-class limits into each floppy
driver, each disk driver, etc. At the very least devices that act the
same way should all be using the same tunable, whether it's a
compile-time tunable (CLASS_xxx_MAX_SECTORS) or a runtime tunable.

Long term, the IO scheduler and the VM should really be figuring out the
best request size, from zero to <hardware limit>.

Jeff



2004-03-28 19:00:14

by Jeff Garzik

[permalink] [raw]
Subject: Re: [PATCH] speed up SATA

Bartlomiej Zolnierkiewicz wrote:
> On Sunday 28 of March 2004 20:30, Jens Axboe wrote:
>>Making something user tunable is usually not the best idea, if you can
>>deduct these things automagically instead. So whether this is the best
>>idea, depends on which way you want to go.
>
>
> I think it's the best idea for now, long-term we are better with automagic.


Mostly agreed:

Like I mentioned in the last message, the IO scheduler and the VM should
really just figure out request size and queue depth and such based on
observation of device throughput and latency. So I agree w/ automagic.

But the sysadmin should also be allowed to say "I don't care about
latency" if he has gobs and gobs of memory and knows his configuration well.

I like generic tunables such as "laptop mode" or "low latency" or "high
throughput". These sorts of tunables should affect the "automagic"
calculations.

Jeff



2004-03-28 19:06:50

by Jeff Garzik

[permalink] [raw]
Subject: Re: [PATCH] speed up SATA

Jens Axboe wrote:
> Yep it sure does, but try and find a drive attached to a SATA controller
> that cannot do 40MiB/sec (or something like that). Storage doesn't move
> _that_ fast, you can keep up.

Nanosecond latencies and disturbingly high throughput are already
possibly today :) Consider the battery-backed RAM gadgets that present
themselves as ATA devices, or nbd over 10gige network.

In fact I'm about to strip down drivers/scsi/sata_promise.c to a driver
that just talks to the DIMM, and another else: drivers/block/pdc_mem.c.
At that point you're really just looking at the PCI bus and RAM limits...

Jeff



2004-03-28 19:52:43

by Nuno Silva

[permalink] [raw]
Subject: Re: [PATCH] speed up SATA

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Bartlomiej Zolnierkiewicz wrote:
| On Sunday 28 of March 2004 20:12, William Lee Irwin III wrote:
|
|>On Sun, Mar 28, 2004 at 07:54:36PM +0200, Jens Axboe wrote:
|>

[...]

|>>hardware. I absolutely refuse to put a global block layer 'optimal io
|>>size' restriction in, since that is the ugliest of policies and without
|>>having _any_ knowledge of what the hardware can do.
|>
|>How about per-device policies and driver hints wrt. optimal io?
|
|
| Yep, user-tunable per-device policies with sane driver defaults.
|

I think that automagic configuration for the common workload with some
way (sysfs|proc) to retrieve and set policies is the way to go.

With this kind of control we could have /etc/init.d/io-optimize that
paused the startup for 10 seconds and tests every device|controller in
fstab and optimizes according to the .conf file for latency or speed...
Or a daemon that retrieves statistics and adjusts the policies every minute?

Also, everybody says "do it in userland". This is doing (some of) it in
userland :)

Regards,
Nuno Silva


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFAZy0GOPig54MP17wRAmWMAKDT3GKF/Wp/yYDzxyX+YK9kkTuMFgCg5mD3
HlngYjEwzo/lRAfHn/tnsQg=
=bC9f
-----END PGP SIGNATURE-----

2004-03-28 20:02:50

by Jeff Garzik

[permalink] [raw]
Subject: Re: [PATCH] speed up SATA

Nuno Silva wrote:
> With this kind of control we could have /etc/init.d/io-optimize that
> paused the startup for 10 seconds and tests every device|controller in
> fstab and optimizes according to the .conf file for latency or speed...
> Or a daemon that retrieves statistics and adjusts the policies every
> minute?
>
> Also, everybody says "do it in userland". This is doing (some of) it in
> userland :)


An iotune daemon that sits in userland -- probably mlockall'd into
memory -- would be fun. Like the existing irqbalance daemon that Red
Hat (disc: my employer) ships, "iotuned" could run in the background,
adjusting policies every so often. It would be a good place to put some
of the more high-level "laptop mode" or "high throughput mode" tunables,
so that the user wouldn't have to worry about the minute details of such
things.

Just watch the IO statistics and tweak the queue settings... Gotta make
sure the queue settings don't get tweaked beyond what the hardware can
do, though.

A good iotune daemon would probably have to pay attention to various VM
statistics as well, since the VM is often the one that decides when to
start writing out stuff... such an app could (quite justifiably)
mushroom into a io-and-vmtune daemon.

This is a side issue from the topic of finding a decent in-kernel
default, of course...

Jeff



2004-03-28 20:12:36

by Jeff Garzik

[permalink] [raw]
Subject: Re: [PATCH] speed up SATA

Jens Axboe wrote:
> On Sun, Mar 28 2004, Jeff Garzik wrote:
>>I bet if we can come up with a decent proposal, with technical rationale
>>for the change... that could be presented to the right ATA people :)
>>It's worth a shot.

> Count me in for that. The ATA people seem to have a preference for

Cool.


> adding a new set of commands for this type of there, where as I
> (originally, I did actually send in a proposal like this on the ml)
> wanted to just use one of the reserved bits in the tag field.

Well, I would like to present the problem, in the proposal, in a
completely separate area of the document than the suggested solution.

Over and above the barrier issue, the general problem of "OS doesn't
know precisely what's on the platter" leads to a nasty edge case:

If an error occurs where the typical resolution is a bus or device
reset, cached writes that have been acknowledged to the OS but not yet
hit the media will likely be lost. This seems to be worse in SATA,
where you have a new class of errors (SATA link up/down, etc.) that is
also typically dealt with via reset.

Jeff



2004-03-28 20:21:58

by Jeff Garzik

[permalink] [raw]
Subject: Re: [PATCH] speed up SATA

Jens Axboe wrote:
> So you could change ->max_sectors to be 'max allowable io, hardware
> limitation' and add ->optimal_sectors to be 'best sized io'. I don't see
> tha it buys you anything, since you'd be going for optimal_sectors all
> the time anyways.


In terms of implementation, I imagine we would add a ->max_hw_sectors,
once the IO scheduler and VM (or userland?) are smart enough to tune
->max_sectors dynamically.

Jeff



2004-03-28 20:27:49

by Jeff Garzik

[permalink] [raw]
Subject: Re: [PATCH] speed up SATA

Stefan Smietanowski wrote:
> Hi Jeff.
>
>> I'm about to add a raft of SATA-2 hardware, all of which are queued.
>> The standard depth is 32, but one board supports a whopping depth of 256.
>
>
> Speaking of which .. I just read an announcement that someone (of course
> the name eludes me) announced a DVD Burner that's SATA.
>
> Found it:
>
> http://www.plextor.com/english/news/press/712SA_pr.htm
>
> a) Are there provisions in the SATA (1) SPEC for support of
> non-disk units?
>
> b) if (strcmp(a, "no"))
> Do you know anything about it, ie is it SATA1 or 2 or what?
>
> c) Let's ponder one gets a unit like this - is it usable with
> libata yet?
>
> d) if (strcmp(c, "no"))
> Will it? :)


SATA ATAPI looks and works just like PATA ATAPI, with one notable
exception: S/ATAPI will include "asynchronous notification", a feature
that allows you to eliminate the polling of the cdrom driver that
normally occurs.

You can use ATAPI on SATA today, using a PATA->SATA bridge. In fact
that's the only way I can test SATA ATAPI at all, right now.

I hope somebody sends me one of these Plextor devices for testing ;-)

Jeff



2004-03-28 20:32:57

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] speed up SATA

Jeff Garzik <[email protected]> wrote:
>
> I like generic tunables such as "laptop mode" or "low latency" or "high
> throughput". These sorts of tunables should affect the "automagic"
> calculations.

Not sure. Things like "low latency" and "high throughput" may need other
things, such as "seek latency" and "bandwidth" as _inputs_, not as outputs.

Such device parameters should have reasonable defaults, and use a userspace
app which runs a quick seek latency and bandwidth test at mount-time,
poking the results into sysfs.

2004-03-28 20:35:23

by Eric D. Mudama

[permalink] [raw]
Subject: Re: [PATCH] speed up SATA

On Sun, Mar 28 at 16:56, Nick Piggin wrote:
>Eric D. Mudama wrote:
>>32-MB requests are the best for raw throughput.
>>
>>~15ms to land at your target location, then pure 50-60MB/sec for the .5
>>seconds it takes to finish the operation. (media limited at that point)
>>
>>Sure, there's more latency, but I guess that is application dependant.
>>
>
>What about a queue depth of 2, and writing that 32MB in 1MB requests?

I'm 99% sure that it'll wind up slower... The concept of "ZLR - zero
latency read?" where they start reading immediately wherever they
land, then merge that into the host request later, possibly catching
the front of the command at the very end can help. However, if you
issue those 32 tags immediately, there's the chance that the drive
won't recognize the pattern as quickly, and wind up doing additional
seeks, or even worse, additional, unnecessary dwell... In queueing,
you could wind up caching data on the back end of 32 host requests
without the first LBA in cache for any of them, which means you can't
start those data transfers until you've read a "first LBA", meaning
you now need to discard something you've intentionally read from the
media.

The kicker when handling any request larger than the on-disk cache
size combined with queueing is that you don't know ahead of time if
you'll be able to satisfy the request by the time you get there. If
the drive was already holding megabytes 4-5 in cache, that works
pretty well as it'll be an instant cache hit, however, if you were
holding megabytes 4-4.99, there's a chance you will have had to
discard some of that data to make room for other in-process disk work
(say you had a buffer almost full of dirty writes and therefore
insufficient buffer for the entire host request), which means you may
need to toss megabytes 4.5-4.99 out, in which case you'll need to do
the disk I/O anyway. In that case, starting at the beginning of the
entire region and just tossing from cache as you transfer to the host
works a lot better.

On a large sequential access, the odds of a cache hit into that
sequential area is minimal in practice, so we can actually just churn
the same small chunk of buffer satisfying the large read, while not
losing any efficiency due to having 6.5MB of dirty random writes ready
to go on the back end.

A 32-MB op by definition cannot be atomic with an 8MB DRAM on the
drive, so once we give up that idea, we optimize our buffer usage for
maximum cache efficiency and granularity. (Basically, if the drive
reports a hard error writing a 32MB request, you need to assume the
entire 32MB request needs to be reissued. You can't assume that just
because the error was on the last block, that you can "trust" that the
first LBA of the request actually made it.)

--eric


--
Eric D. Mudama
[email protected]

2004-03-28 20:48:40

by Jeff Garzik

[permalink] [raw]
Subject: Re: [PATCH] speed up SATA

Andrew Morton wrote:
> Jeff Garzik <[email protected]> wrote:
>
>> I like generic tunables such as "laptop mode" or "low latency" or "high
>> throughput". These sorts of tunables should affect the "automagic"
>> calculations.
>
>
> Not sure. Things like "low latency" and "high throughput" may need other
> things, such as "seek latency" and "bandwidth" as _inputs_, not as outputs.

I should probably better define the hypotheticals :) I think of "laptop
mode" or "low latency versus high throughput" more as high level binary
flags, influencing widely varying things like from an ATA disk's "low
noise versus high performance" tunable to the IO scheduler's deadlines.


> Such device parameters should have reasonable defaults, and use a userspace
> app which runs a quick seek latency and bandwidth test at mount-time,
> poking the results into sysfs.

Certainly...

Jeff



2004-03-28 20:52:09

by Eric D. Mudama

[permalink] [raw]
Subject: Re: [PATCH] speed up SATA

On Sun, Mar 28 at 12:54, Jeff Garzik wrote:
>Jamie Lokier wrote:
>>This is what I mean: turn off write cacheing, and performance on PATA
>>drops because of the serialisation and lost inter-command time.
>>
>>With TCQ-on-write, you can turn off write cacheing and in theory
>>performance doesn't have to drop, is that right?
>
>hmmm, that's a good point. I honestly don't know. Something to be
>tested, I suppose...
>
>My premature guess would be that you would need a queue depth larger
>than 2 or 4 before performance starts to not-suck.

Write-cache off + queueing means that the drive can't "finish" the
command until it is on the media. This means that until the block is
written, you won't get a 0x50 status or a SetBits FIS.

In theory, write-cache-off queueing will be slightly slower than
random reads+queueing, since most drives have slightly stricter
guidelines for write-settle versus read-settle. (read settle you can
let your ECC hardware tell you that you pooched a seek and went
offtrack because you were still vibrating, for writes that is liable
to corrupt good data on adjacent tracks which you may not have in
buffer and therefore might not be able to repair.)

However, cached writes (queued or unqueued), especially small ones,
will have WAY higher ops/sec. ATA TCQ is limited to 32 choices for
the next best operation, but an 8MB buffer doing 4K random-ops could
potentially have ~2000 choices for the next thing to do. (assuming
perfect cache granularity, etc, etc)

At 32 choices, the seek and rotate are still somewhat significant,
though way better than unqueued. With 2000 things to choose from, the
drive is never, ever idle. Seek, land just-in-time, read/write,
rinse/repeat.

--eric


--
Eric D. Mudama
[email protected]

2004-03-28 20:54:17

by Eric D. Mudama

[permalink] [raw]
Subject: Re: [PATCH] speed up SATA

On Sun, Mar 28 at 15:12, Jeff Garzik wrote:
>Over and above the barrier issue, the general problem of "OS doesn't
>know precisely what's on the platter" leads to a nasty edge case:
>
>If an error occurs where the typical resolution is a bus or device
>reset, cached writes that have been acknowledged to the OS but not yet
>hit the media will likely be lost. This seems to be worse in SATA,
>where you have a new class of errors (SATA link up/down, etc.) that is
>also typically dealt with via reset.
>
> Jeff

This shouldn't be the case.

We'll always complete writes we've given good status for prior to
saying "ok" on an incoming reset. Hard or soft, all resets wait on a
clean internal cache before we process the reset itself.

--eric


--
Eric D. Mudama
[email protected]

2004-03-28 20:58:46

by Eric D. Mudama

[permalink] [raw]
Subject: Re: [PATCH] speed up SATA


Er, forgot about the queue depth of only 2...

Even in that case, you'll more than likely still get better throughput
with a single 32-MB command... If you send a pair of queued commands
down, and the 2nd one is chosen, there's no reason that the first one
won't get starved until the very end of the request, which would have
bad latency on that command.

Even worse, would be it getting force-promoted in the middle of the
rest of the 32MB chunk due to an approaching internal timeout, and
therefore interrupting a 31-MB sequential read to do a 1MB read.

However, most of this is silly... 32 1MB requests will be way faster
than 32,000 1LBA requests, etc... in general, bigger requests = better
throughput for all but a few specific workloads that I can think of
off the top of my head. (lots of simultaneous stream workloads with
very low latency requirements per-stream, for example)

--
Eric D. Mudama
[email protected]

2004-03-28 21:17:12

by Stefan Smietanowski

[permalink] [raw]
Subject: Re: [PATCH] speed up SATA

Hi Jeff.

> SATA ATAPI looks and works just like PATA ATAPI, with one notable
> exception: S/ATAPI will include "asynchronous notification", a feature
> that allows you to eliminate the polling of the cdrom driver that
> normally occurs.
>
> You can use ATAPI on SATA today, using a PATA->SATA bridge. In fact
> that's the only way I can test SATA ATAPI at all, right now.
>
> I hope somebody sends me one of these Plextor devices for testing ;-)

Would that mean that one uses the same (sub)drivers as normal SCSI
devices do ?

sd/sg/etc ...

// Stefan

2004-03-28 21:26:25

by Jeff Garzik

[permalink] [raw]
Subject: Re: [PATCH] speed up SATA

Stefan Smietanowski wrote:
> Hi Jeff.
>
>> SATA ATAPI looks and works just like PATA ATAPI, with one notable
>> exception: S/ATAPI will include "asynchronous notification", a
>> feature that allows you to eliminate the polling of the cdrom driver
>> that normally occurs.
>>
>> You can use ATAPI on SATA today, using a PATA->SATA bridge. In fact
>> that's the only way I can test SATA ATAPI at all, right now.
>>
>> I hope somebody sends me one of these Plextor devices for testing ;-)
>
>
> Would that mean that one uses the same (sub)drivers as normal SCSI
> devices do ?
>
> sd/sg/etc ...

Correct.

Jeff




2004-03-29 00:55:17

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] speed up SATA

On Sun, Mar 28, 2004 at 01:59:51PM -0500, Jeff Garzik wrote:
> Bartlomiej Zolnierkiewicz wrote:
> >On Sunday 28 of March 2004 20:30, Jens Axboe wrote:
> >>Making something user tunable is usually not the best idea, if you can
> >>deduct these things automagically instead. So whether this is the best
> >>idea, depends on which way you want to go.
> >
> >
> >I think it's the best idea for now, long-term we are better with automagic.
>
>
> Mostly agreed:
>
> Like I mentioned in the last message, the IO scheduler and the VM should

this is not an I/O scheduler or VM issue.

the max size of a request is something that should be set internally to
the blkdev layer (at a lower level than the I/O scheduler or the VM
layer).

The point is that if you run read contigously from disk with a 1M or 32M
request size, the wall time speed difference will be maybe 0.01% or so.
Running 100 irqs per second or 3 irq per second doesn't make any
measurable difference. Same goes for keeping the I/O pipeline full, 1M
is more than enough to go at the speed of the storage with minimal cpu
overhead. we waste 900 irqs per second just in the timer irq and
another 900 irqs per second per-cpu in the per-cpu local interrupts in
smp.

In 2.4 reaching 512k DMA units that helped a lot, but going past 512k
didn't help in my measurements. 1M maybe these days is needed (as Jens
suggested) but >1M still sounds overkill and I completely agree with
Jens about that.

If one day things will change and the harddisk will require 32M large
DMA transactions to keep up with the speed of the disk, the thing should
be still solved during disk discovery inside the blkdev layer. The
"automagic" suggestions discussed by Jamie and Jens should be just
benchmarks internal to the blkdev layer, trying to read contigously
first with 1M then 2M then 4M etc.. until the speed difference goes
below 1% or whatever similar "autotune" algorithm.

But definitely this is not an I/O scheduler or VM issue, it's all about
discovering the minimal DMA transaction size that provides peak bulk I/O
performance for a certain device. The smaller the size, the better the
latencies and the less ram will be pinned at the same time (i.e. think a
64M machine writing at 32M chunks at time).

Of course if we'll ever deal with hardware where 32M requests makes a
difference, then we may have to add overrides to the I/O scheduler to
lower the max_requests (i.e. like my obsolete max_bomb_segments did).
But I expect that by default the contigous I/O will use the max_sector
choosen by the blkdev layer (not choosen by VM or I/O scheduler) to
guarantee the best bulk I/O performance as usual (the I/O scheduler
option would be just an optional override). the max_sectors is just
about using a sane DMA transaction size, good enough to run at
disk-speed without measurable cpu overhead, but without being too big so
that it provides sane latencies. Overkill huge DMA transactions might
even stall the cpu when accessing the mem bus (though I'm not an
hardware guru so this is just a guess).

So far there was no need to autotune it, and settings like 512k were
optimal.

Don't take me wrong, I find extremely great that you now can raise the
IDE request size to a value like 512k, the 128k limit was the ugliest
thing of IDE ever, but you provided zero evidence that going past 512k
is beneficial at all, and your bootup log showing 32M is all but
exciting, I'd be a lot more excited to see 512k there.

I expect that the boost from 128k to 512k is very significant, but I
expect that from 512k to 32M there will be just a total waste of latency
with zero performance gain in throughput. So unless you measure any
speed difference from 512k to 32M I recommend to set it to 512k for the
short term like most other driver does for the same reasons.

2004-03-29 01:30:16

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH] speed up SATA

Eric D. Mudama wrote:

>
> Er, forgot about the queue depth of only 2...
>
> Even in that case, you'll more than likely still get better throughput
> with a single 32-MB command... If you send a pair of queued commands
> down, and the 2nd one is chosen, there's no reason that the first one
> won't get starved until the very end of the request, which would have
> bad latency on that command.
>

Well strictly, you send them one after the other. So unless you
have something similar to our anticipatory scheduler or plugging
mechanism, the drive should attack the first one first, shouldn't
it?


2004-03-29 04:04:16

by Jeff Garzik

[permalink] [raw]
Subject: Re: [PATCH] speed up SATA

Andrea Arcangeli wrote:
> On Sun, Mar 28, 2004 at 01:59:51PM -0500, Jeff Garzik wrote:
>
>>Bartlomiej Zolnierkiewicz wrote:
>>
>>>On Sunday 28 of March 2004 20:30, Jens Axboe wrote:
>>>
>>>>Making something user tunable is usually not the best idea, if you can
>>>>deduct these things automagically instead. So whether this is the best
>>>>idea, depends on which way you want to go.
>>>
>>>
>>>I think it's the best idea for now, long-term we are better with automagic.
>>
>>
>>Mostly agreed:
>>
>>Like I mentioned in the last message, the IO scheduler and the VM should
>
>
> this is not an I/O scheduler or VM issue.

This involves the interaction of three: blkdev layer, IO scheduler, and VM.

VM: initiates most of the writeback, and is often the main initiator of
large requests. The VM thresholds also serve to keep request size
manageable. See e.g.
http://marc.theaimsgroup.com/?l=linux-kernel&m=108043321326801&w=2

IO scheduler: the place to make the decision about whether the request
latency is meeting expectations, etc. It should be straightforward to
use a windowing algorithm to slowly increase the request size until (a)
latency limits are reached, (b) hardware limits are reached, or (c) VM
thresholds are reached.

Ultimately there must be some -global- management of I/O, otherwise VM
cannot survive, e.g. 128k requests on 1000 disks :)


> the max size of a request is something that should be set internally to
> the blkdev layer (at a lower level than the I/O scheduler or the VM
> layer).

Yes, I agree.

My point is there are two maximums:

1) the hardware limit
2) the limit that "makes sense", e.g. 512k or 1M for most

The driver should only care about #1, and should be "told" #2.

A very, very, very minimal implementation could be this:

--- 1.138/include/linux/blkdev.h Fri Mar 12 04:33:07 2004
+++ edited/include/linux/blkdev.h Sun Mar 28 22:44:15 2004
@@ -607,6 +607,24 @@

extern void drive_stat_acct(struct request *, int, int);

+#define BLK_DISK_MAX_SECTORS 2048
+#define BLK_FLOPPY_MAX_SECTORS 64


Hardcoding such a maximum in the driver is inflexible and IMO incorrect.


> If one day things will change and the harddisk will require 32M large
> DMA transactions to keep up with the speed of the disk, the thing should
> be still solved during disk discovery inside the blkdev layer. The

32M is probably too large, but 1M is probably too small for:
a RAID array with 33 disks, that presents itself as a single SATA disk.
solid-state storage: battery-backed RAM.

These things like bigger requests, and were designed to solve a lot of
the latency problems in hardware.


> "automagic" suggestions discussed by Jamie and Jens should be just
> benchmarks internal to the blkdev layer, trying to read contigously
> first with 1M then 2M then 4M etc.. until the speed difference goes
> below 1% or whatever similar "autotune" algorithm.

Yes, agreed.

My main goal is to -not- worry about this in the low-level driver. If
you and Jens think 1M requests are maximum for disks, then put that in
the _blkdev_ layer not my driver :)

Long term, I would like to see something like

--- 1.138/include/linux/blkdev.h Fri Mar 12 04:33:07 2004
+++ edited/include/linux/blkdev.h Sun Mar 28 23:01:42 2004
@@ -337,7 +337,8 @@
*/
unsigned long nr_requests; /* Max # of requests */

- unsigned short max_sectors;
+ unsigned short max_sectors; /* blk layer-chosen */
+ unsigned short max_hw_sectors; /* hardware limit */
unsigned short max_phys_segments;
unsigned short max_hw_segments;


2004-03-29 04:30:24

by Wim Coekaerts

[permalink] [raw]
Subject: Re: [PATCH] speed up SATA

> In 2.4 reaching 512k DMA units that helped a lot, but going past 512k
> didn't help in my measurements. 1M maybe these days is needed (as Jens
> suggested) but >1M still sounds overkill and I completely agree with
> Jens about that.

at least 1Mb... more than 1mb (I doubt 32mb is really necessarily
useful) is nice for flushing contiguous journal data to disk. between
1-8mb in one io.

at least 1mb is good when you have to process massive amounts of data,
just read huge chunks of files, we tend to do 1mb. anyway,
as you said, at some point it's a bit overkill . I thinkg 1-8mb makes
sense.

2004-03-29 04:32:40

by William Lee Irwin III

[permalink] [raw]
Subject: Re: [PATCH] speed up SATA

On Mon, Mar 29, 2004 at 02:55:02AM +0200, Andrea Arcangeli wrote:
> The point is that if you run read contigously from disk with a 1M or 32M
> request size, the wall time speed difference will be maybe 0.01% or so.
> Running 100 irqs per second or 3 irq per second doesn't make any
> measurable difference. Same goes for keeping the I/O pipeline full, 1M
> is more than enough to go at the speed of the storage with minimal cpu
> overhead. we waste 900 irqs per second just in the timer irq and
> another 900 irqs per second per-cpu in the per-cpu local interrupts in
> smp.
> In 2.4 reaching 512k DMA units that helped a lot, but going past 512k
> didn't help in my measurements. 1M maybe these days is needed (as Jens
> suggested) but >1M still sounds overkill and I completely agree with
> Jens about that.

Maybe we should try this the other way around. Assume those who want
larger request sizes are rational (unlikely as that may be) for the
sake of argument; what could they possibly be after? I'm thinking that
once we know what they're really after we might be able to satisfy the
needs some other way that makes more sense for the general case.

I have a wild guess it may reduce interrupt load and/or overhead of
request submission to the HBA by the factor of the request size
increase as the number of spindles (and of course the RAM with which
to do buffering) grows without bound, but otherwise no real notion of
why on earth anyone could want it. Maybe we should find out who (if
anyone) claims they really want the unreasonably large requests first
so they can be asked directly.


-- wli

2004-03-29 04:58:15

by Jeff Garzik

[permalink] [raw]
Subject: Re: [PATCH] speed up SATA

Andrea Arcangeli wrote:
> is beneficial at all, and your bootup log showing 32M is all but
> exciting, I'd be a lot more excited to see 512k there.


Just to clarify... 32MB would never ever be reached. The S/G table
limit means requests are limited to 8MB. VM thresholds and user
application use further limit request size.

I think Andrew's point is actually more relevant than examining the size
of a single request:

> the effect of really big requests will be the same
> as the effect of permitting _more_ requests.

Thus like the "1,000 disks" example, memory management needs to make
sure that an "unreasonable" amount of memory is not being pinned.

Jeff



2004-03-29 05:23:30

by Eric D. Mudama

[permalink] [raw]
Subject: Re: [PATCH] speed up SATA

On Mon, Mar 29 at 11:30, Nick Piggin wrote:
>Well strictly, you send them one after the other. So unless you
>have something similar to our anticipatory scheduler or plugging
>mechanism, the drive should attack the first one first, shouldn't
>it?

If you send 32 commands to our disk at once (TCQ/NCQ) we send 'em all
to our back-end disk engine as fast as possible.

--eric


--
Eric D. Mudama
[email protected]

2004-03-29 07:36:18

by Denis Vlasenko

[permalink] [raw]
Subject: Re: [PATCH] speed up SATA

On Monday 29 March 2004 06:29, Wim Coekaerts wrote:
> > In 2.4 reaching 512k DMA units that helped a lot, but going past 512k
> > didn't help in my measurements. 1M maybe these days is needed (as Jens
> > suggested) but >1M still sounds overkill and I completely agree with
> > Jens about that.
>
> at least 1Mb... more than 1mb (I doubt 32mb is really necessarily
> useful) is nice for flushing contiguous journal data to disk. between
> 1-8mb in one io.
>
> at least 1mb is good when you have to process massive amounts of data,
> just read huge chunks of files, we tend to do 1mb. anyway,
> as you said, at some point it's a bit overkill . I thinkg 1-8mb makes
> sense.

It makes sense *today*. Do you want to return to this discussion
ad infinitum?

Andrea is right, block layer can be coded to detect such a (per-device)
request size increasing which does not give you any more throughput,
or increasing which hurts your latency too much, and stick with it.

More detailed:

1. size = 1 sector
2 measure throughput and latency
3. size *= 2
4. measure new_throughput and new_latency
5. if(new_throughput<=throughput*1.01 || new_latency>50ms) break
6. throughput = new_throughput; latency = new_latency; goto 3

Real code will have admin overrides (start from N sectors,
never go higher than M sectors etc), but you got the idea.
--
vda

2004-03-29 08:11:08

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH] speed up SATA

On Sun, Mar 28 2004, Jeff Garzik wrote:
> >The driver may not know exactly, but it does know a ball park figure.
> >You know if you are driving floppy (sucky transfer and latency), hard
> >drive, cdrom (decent transfer, sucky seeks), etc.
>
> Agreed. Really we have two types of information:
> * the device's hard limit
> * the default limit that should be applied to that class of devices
>
> I would much rather do something like
>
> blk_queue_set_class(q, CLASS_DISK)
>
> and have that default per-class policy
>
> switch (class) {
> case CLASS_DISK:
> q->max_sectors = min(q->max_sectors, CLASS_DISK_MAX_SECTORS);
> ...

Oh god no, I was afraid you'd come up with something like this :-). I'm
quite sure this will lead to an unmaintainable mess, as more and various
different classes are introduced. Don't try to model something that you
really cannot model.

> than hardcode the limit in the driver. That's easy and quick. That's a
> minimal solution that gives me what I want -- don't hardcode generic
> limits in the driver -- while IMO giving you what you want, a sane limit
> in an easy way.
>
> Right now we are hardcoding the same per-class limits into each floppy
> driver, each disk driver, etc. At the very least devices that act the
> same way should all be using the same tunable, whether it's a
> compile-time tunable (CLASS_xxx_MAX_SECTORS) or a runtime tunable.
>
> Long term, the IO scheduler and the VM should really be figuring out the
> best request size, from zero to <hardware limit>.

Lets leave it like it is, and try and tweak latencies dynamically. Could
be used to limit tcq depth, not just request sizes solving two problems
at once. I already have a tiny bit of keeping this accounting to do
proper unplugs (right now it just looks at missing requests from the
pool, doesn't work on tcq).

--
Jens Axboe

2004-03-29 08:16:43

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH] speed up SATA

On Sun, Mar 28 2004, Wim Coekaerts wrote:
> > In 2.4 reaching 512k DMA units that helped a lot, but going past 512k
> > didn't help in my measurements. 1M maybe these days is needed (as Jens
> > suggested) but >1M still sounds overkill and I completely agree with
> > Jens about that.
>
> at least 1Mb... more than 1mb (I doubt 32mb is really necessarily
> useful) is nice for flushing contiguous journal data to disk. between
> 1-8mb in one io.

'is nice' means what, in real numbers? :)

> at least 1mb is good when you have to process massive amounts of data,
> just read huge chunks of files, we tend to do 1mb. anyway,
> as you said, at some point it's a bit overkill . I thinkg 1-8mb makes
> sense.

Yeah, that _range_ makes sense. 1MB is a lot more supportable than 8MB
though, so the benefit needs to be something more than a feel good
sensation.

--
Jens Axboe

2004-03-29 12:32:38

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH] speed up SATA

Hi!

> >What about latency?
...
> merged transfer will top out at 8MB. You don't see that unless
> you're on a totally quiet machine with tons of free, contiguous
> pages. So in practice it winds up being much smaller, the more
> loaded the system gets (and pagecache gets fragmented).
>
> Latency definitely changes for the default case, but remember that a
> lot of that is writeback, or streaming writes. Latency-sensitive
> applications already know how to send small or no-wait I/Os, because

Well, waiting second for read because 32MB write is being done
will not be fun...
--
64 bytes from 195.113.31.123: icmp_seq=28 ttl=51 time=448769.1 ms

2004-03-29 12:32:37

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH] speed up SATA

Hi!

> Things seem to be looking pretty good, so it's now time to turn on
> lba48-sized transfers. Most SATA disks will be lba48 anyway, even
> the ones smaller than 137GB, for this and other reasons.
>
> With this simple patch, the max request size goes from 128K to
> 32MB... so you can imagine this will definitely help performance.
> Throughput goes up. Interrupts go down. Fun for the whole family.

32MB is one second if everything goes okay... That will be horrible for latency, right?
--
64 bytes from 195.113.31.123: icmp_seq=28 ttl=51 time=448769.1 ms

2004-03-29 12:45:01

by Jamie Lokier

[permalink] [raw]
Subject: Re: [PATCH] speed up SATA

Jens Axboe wrote:
> Could be used to limit tcq depth, not just request sizes solving two
> problems at once. I already have a tiny bit of keeping this
> accounting to do proper unplugs (right now it just looks at missing
> requests from the pool, doesn't work on tcq).

Does it make sense to allow different numbers of outstanding TCQ-reads
and TCQ-writes?

-- Jamie

2004-03-29 12:46:37

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH] speed up SATA

On Mon, Mar 29 2004, Jamie Lokier wrote:
> Jens Axboe wrote:
> > Could be used to limit tcq depth, not just request sizes solving two
> > problems at once. I already have a tiny bit of keeping this
> > accounting to do proper unplugs (right now it just looks at missing
> > requests from the pool, doesn't work on tcq).
>
> Does it make sense to allow different numbers of outstanding TCQ-reads
> and TCQ-writes?

Might not be a silly thing to experiment with, definitely something that
should be tested (added to list...)

--
Jens Axboe

2004-03-29 12:54:21

by Jamie Lokier

[permalink] [raw]
Subject: Re: [PATCH] speed up SATA

Jens Axboe wrote:
> > Does it make sense to allow different numbers of outstanding TCQ-reads
> > and TCQ-writes?
>
> Might not be a silly thing to experiment with, definitely something that
> should be tested (added to list...)

Then there's another question: when deciding whether you can issue
another TCQ-read, do you compare the number of outstanding TCQ-reads
against the TCQ-read limit, or compare the _total_ number of
outstanding TCQs against the TCQ-read limit? Similarly for writes.

There are several other logical combinations that could be used.

Each condition will have a different effect on mean and peak latencies
under different load patterns. I'm not sure which makes more sense,
or even if multiple conditions should be used together.

-- Jamie

2004-03-29 13:08:41

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] speed up SATA

On Sun, Mar 28, 2004 at 08:29:12PM -0800, Wim Coekaerts wrote:
> > In 2.4 reaching 512k DMA units that helped a lot, but going past 512k
> > didn't help in my measurements. 1M maybe these days is needed (as Jens
> > suggested) but >1M still sounds overkill and I completely agree with
> > Jens about that.
>
> at least 1Mb... more than 1mb (I doubt 32mb is really necessarily
> useful) is nice for flushing contiguous journal data to disk. between
> 1-8mb in one io.
>
> at least 1mb is good when you have to process massive amounts of data,
> just read huge chunks of files, we tend to do 1mb. anyway,
> as you said, at some point it's a bit overkill . I thinkg 1-8mb makes
> sense.

1M certainly it's reasonable, though for 8M I've my doubt you can
measure a throughput improvement, can you?

2004-03-29 13:10:45

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [PATCH] speed up SATA

On Mon, 2004-03-29 at 14:44, Jens Axboe wrote:
> On Mon, Mar 29 2004, Jamie Lokier wrote:
> > Jens Axboe wrote:
> > > Could be used to limit tcq depth, not just request sizes solving two
> > > problems at once. I already have a tiny bit of keeping this
> > > accounting to do proper unplugs (right now it just looks at missing
> > > requests from the pool, doesn't work on tcq).
> >
> > Does it make sense to allow different numbers of outstanding TCQ-reads
> > and TCQ-writes?
>
> Might not be a silly thing to experiment with, definitely something that
> should be tested (added to list...)

I wonder if the max read size could/should be correlated with the
readahead size for such devices... it sounds like a related property at
least.


Attachments:
signature.asc (189.00 B)
This is a digitally signed message part

2004-03-29 13:12:00

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH] speed up SATA

On Mon, Mar 29 2004, Arjan van de Ven wrote:
> On Mon, 2004-03-29 at 14:44, Jens Axboe wrote:
> > On Mon, Mar 29 2004, Jamie Lokier wrote:
> > > Jens Axboe wrote:
> > > > Could be used to limit tcq depth, not just request sizes solving two
> > > > problems at once. I already have a tiny bit of keeping this
> > > > accounting to do proper unplugs (right now it just looks at missing
> > > > requests from the pool, doesn't work on tcq).
> > >
> > > Does it make sense to allow different numbers of outstanding TCQ-reads
> > > and TCQ-writes?
> >
> > Might not be a silly thing to experiment with, definitely something that
> > should be tested (added to list...)
>
> I wonder if the max read size could/should be correlated with the
> readahead size for such devices... it sounds like a related property at
> least.

Indeed, it would be best to keep the read-ahead window at least a
multiple of the max read size. So you don't get the nasty effects of
having a 128k read-ahead window, but device with 255 sector limit
resulting in 124KB + 4KB read.

--
Jens Axboe

2004-03-29 13:18:45

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] speed up SATA

On Sun, Mar 28, 2004 at 11:02:43PM -0500, Jeff Garzik wrote:
> My point is there are two maximums:
>
> 1) the hardware limit
> 2) the limit that "makes sense", e.g. 512k or 1M for most
>
> The driver should only care about #1, and should be "told" #2.
>
> A very, very, very minimal implementation could be this:
>
> --- 1.138/include/linux/blkdev.h Fri Mar 12 04:33:07 2004
> +++ edited/include/linux/blkdev.h Sun Mar 28 22:44:15 2004
> @@ -607,6 +607,24 @@
>
> extern void drive_stat_acct(struct request *, int, int);
>
> +#define BLK_DISK_MAX_SECTORS 2048
> +#define BLK_FLOPPY_MAX_SECTORS 64
>
>
> Hardcoding such a maximum in the driver is inflexible and IMO incorrect.

you're complaining the current API and you're arguing long term the
selection of the dma size for bulk I/O should change and be more
dynamic, but this doesn't change your patch is still wrong today.

Once you change the API too, then you can set the hardwre limit in your
driver and relay on the highlevel blkdev code to find the optimal runtime
dma size for bulk I/O, but today it's your driver that is enforcing a
runtime bulk dma size, not the maximim limit of the controller, and so
you should code your patch accordingly.

Probably both VM and I/O scheduler will require an override if we're
going to set the dma size to 32M anytime in the future (though even that
isn't obvious, I mean a battery backed ram isn't going to take that a
long latency to read/write 32M). But that VM/IOscheduler stuff will be
an "override", the blkdev value you're playing with is a blkdev-only
thing indipdentent from VM and I/O scheduler, and it's the min dma size
required to reach bulk I/O performance with minimal cpu overhead and
that's just an hardware issue that you can measure with an OS w/o VM and
with an OS w/o I/O scheduler. If it's too big the VM and the I/O
scheduler may want to reduce it for various reasons, but the decision of
this value from a blkdev point of view is indipendent from VM and I/O
scheduler as far as I can tell. VM and I/O scheduler cares about this
bit only if it's too big because if it's too big it hurts on latency and
memory pressure.

2004-03-29 13:07:08

by Jamie Lokier

[permalink] [raw]
Subject: Re: [PATCH] speed up SATA

Eric D. Mudama wrote:
> On Mon, Mar 29 at 11:30, Nick Piggin wrote:
> >Well strictly, you send them one after the other. So unless you
> >have something similar to our anticipatory scheduler or plugging
> >mechanism, the drive should attack the first one first, shouldn't
> >it?
>
> If you send 32 commands to our disk at once (TCQ/NCQ) we send 'em all
> to our back-end disk engine as fast as possible.

Are they sent _at once_, or are they sent in sequence? If they're
sent in sequence, even if it's a very rapid sequence, than Nick's
point still stands. If you're not attacking the first request which
arrives, the instant the drive code sees it, you're doing something
similar to the anticipatory scheduler.

-- Jamie

2004-03-29 17:19:58

by Craig I. Hagan

[permalink] [raw]
Subject: Re: [PATCH] speed up SATA

> > Might not be a silly thing to experiment with, definitely something that
> > should be tested (added to list...)
>
> I wonder if the max read size could/should be correlated with the
> readahead size for such devices... it sounds like a related property at
> least.

I'd like to see that (for areas where this is supported). This would
make folks with fibre storage much happier.

2004-03-29 18:19:20

by Jeff Garzik

[permalink] [raw]
Subject: Re: [PATCH] speed up SATA

I don't care how the implementation looks. The most basic points are
that driver writers should not have to care about the value of this
1M-max-request-size magic number, nor is it in any way specific to SATA.

Jeff




2004-03-29 18:48:30

by David Lang

[permalink] [raw]
Subject: Re: [PATCH] speed up SATA

Pavel, when you say that 32MB would be 1 second is that due to the limits
of the SATA bus or are you baseing this on a particular drive?

if you are basing it on a current single SATA drive (7200rpm 8MB cache,
etc) consider that if you are writing to a solid state drive or to an
array that presents itself as a single SATA drive then the transaction can
happen MUCH faster.

if this is a limit of the SATA interface bitrate, consider that over the
last several years EVERY interface has seen it's bitrate climb
significantly, frequently with little (if any) fundamentalchange to the
drivers nessasary to run things.

David Lang

On Mon, 29 Mar 2004, Pavel Machek wrote:

> Date: Mon, 29 Mar 2004 13:36:42 +0200
> From: Pavel Machek <[email protected]>
> To: Jeff Garzik <[email protected]>
> Cc: [email protected], Linux Kernel <[email protected]>,
> Andrew Morton <[email protected]>
> Subject: Re: [PATCH] speed up SATA
>
> Hi!
>
> > Things seem to be looking pretty good, so it's now time to turn on
> > lba48-sized transfers. Most SATA disks will be lba48 anyway, even
> > the ones smaller than 137GB, for this and other reasons.
> >
> > With this simple patch, the max request size goes from 128K to
> > 32MB... so you can imagine this will definitely help performance.
> > Throughput goes up. Interrupts go down. Fun for the whole family.
>
> 32MB is one second if everything goes okay... That will be horrible for latency, right?
> --
> 64 bytes from 195.113.31.123: icmp_seq=28 ttl=51 time=448769.1 ms
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

2004-03-29 19:46:22

by Jeff Garzik

[permalink] [raw]
Subject: Re: [PATCH] speed up SATA

Andrea Arcangeli wrote:
> Once you change the API too, then you can set the hardwre limit in your
> driver and relay on the highlevel blkdev code to find the optimal runtime
> dma size for bulk I/O, but today it's your driver that is enforcing a
> runtime bulk dma size, not the maximim limit of the controller, and so
> you should code your patch accordingly.

This magic 512k or 1M limit has nothing to do with SATA. It's a magic
number whose definition is "we don't want PATA or SATA or SCSI disks
doing larger requests than this for latency, VM, and similar reasons."

That definition belongs outside the low-level SATA driver.

This 512k/1M value is sure to change over time, which involves
needlessly plugging the same value into multiple places. And when such
changes occurs, you must be careful not to exceed hardware- and
errata-based limits -- of which there is no distinction in the code.

I think the length of this discussion alone clearly implies that the
low-level driver should not be responsible for selecting this value, if
nothing else ;-)

Jeff



2004-03-29 20:13:51

by Jeff Garzik

[permalink] [raw]
Subject: Re: [PATCH] speed up SATA

David Lang wrote:
> Pavel, when you say that 32MB would be 1 second is that due to the limits
> of the SATA bus or are you baseing this on a particular drive?
>
> if you are basing it on a current single SATA drive (7200rpm 8MB cache,
> etc) consider that if you are writing to a solid state drive or to an
> array that presents itself as a single SATA drive then the transaction can
> happen MUCH faster.
>
> if this is a limit of the SATA interface bitrate, consider that over the
> last several years EVERY interface has seen it's bitrate climb
> significantly, frequently with little (if any) fundamentalchange to the
> drivers nessasary to run things.

Sata-1 is 1.5Gbps, and Sata-2 is 3.0Gbps. 6.0Gbps is the next step
people seem to be working on...

Jeff




2004-03-30 05:54:57

by Eric D. Mudama

[permalink] [raw]
Subject: Re: [PATCH] speed up SATA

On Mon, Mar 29 at 13:36, Pavel Machek wrote:
>32MB is one second if everything goes okay... That will be horrible for latency, right?

It'll be somewhere between half a second and 1.5 seconds, depending on
how old your hard drive is, all other things being removed from the
equation. The media rate of modern IDE drives is about 60MB/s right
now, and going up slightly soon. The top SCSI drives are around
85MB/s if I remember correctly.

On average, 1s is about right.

It becomes a tradeoff between a big hit now and a lot of little hits
over time. If you don't expect to have any idle time soon, you're
probably better off flushing as big a chunk as possible to free RAM
for the upcoming/continuous workload. If you think you might have
idle time, the lower MB/s of smaller requests will keep you more
granular and improve latency, even though overall work completed per
unit time will be slightly lower.

For a bursty access pattern, you're probably best off with small
requests that fit within the drive's cache, but I don't have any sort
of measurable data on this.

Most of the OS testing that we're subjected to doesn't attempt large
requests.

--eric

--
Eric D. Mudama
[email protected]

2004-03-30 11:11:04

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH] speed up SATA

On Mon, Mar 29 2004, Jeff Garzik wrote:
> Andrea Arcangeli wrote:
> >Once you change the API too, then you can set the hardwre limit in your
> >driver and relay on the highlevel blkdev code to find the optimal runtime
> >dma size for bulk I/O, but today it's your driver that is enforcing a
> >runtime bulk dma size, not the maximim limit of the controller, and so
> >you should code your patch accordingly.
>
> This magic 512k or 1M limit has nothing to do with SATA. It's a magic
> number whose definition is "we don't want PATA or SATA or SCSI disks
> doing larger requests than this for latency, VM, and similar reasons."
>
> That definition belongs outside the low-level SATA driver.
>
> This 512k/1M value is sure to change over time, which involves
> needlessly plugging the same value into multiple places. And when such
> changes occurs, you must be careful not to exceed hardware- and
> errata-based limits -- of which there is no distinction in the code.
>
> I think the length of this discussion alone clearly implies that the
> low-level driver should not be responsible for selecting this value, if
> nothing else ;-)

Here's a quickly done patch that attempts to adjust the value based on a
previous range of completed requests. It changes ->max_sectors to be a
hardware limit, adding ->optimal_sectors to be our max issued io target.
It is split on READ and WRITE. The target is to keep request execution
time under BLK_IORATE_TARGET, which is 50ms in this patch. read-ahead
max window is kept within a single request in size.

So this is pretty half-assed, but it gets the point across. Things that
should be looked at (meaning - should be done, but I didn't want to
waste time on them now):

- I added the hook to see how many in-drive queued requests you have, so
there's the possibility to add tcq knowledge as well.

- The optimal_sectors calculation is just an average of previous maximum
with current maximum. The scheme has a number of break down points,
for instance writes starving reads will give you a bad read execution
time, further limiting ->optimal_sectors[READ]

- HZ == 1000 is hardcoded

- bdi->ra_pages setting is just best effort, there's no attempt made to
synchronize this with the issuer. Would require too much effort for
probably zero real gain.

===== drivers/block/elevator.c 1.53 vs edited =====
--- 1.53/drivers/block/elevator.c Mon Jan 19 07:38:36 2004
+++ edited/drivers/block/elevator.c Tue Mar 30 12:59:57 2004
@@ -150,6 +150,15 @@
void elv_requeue_request(request_queue_t *q, struct request *rq)
{
/*
+ * it already went through dequeue, we need to decrement the
+ * in_flight count again
+ */
+ if (blk_account_rq(rq)) {
+ WARN_ON(q->in_flight == 0);
+ q->in_flight--;
+ }
+
+ /*
* if iosched has an explicit requeue hook, then use that. otherwise
* just put the request at the front of the queue
*/
@@ -221,6 +230,9 @@
}
}

+ if (rq)
+ rq->issue_time = jiffies;
+
return rq;
}

@@ -229,6 +241,21 @@
elevator_t *e = &q->elevator;

/*
+ * the time frame between a request being removed from the lists
+ * and to it is freed is accounted as io that is in progress at
+ * the driver side. note that we only account requests that the
+ * driver has seen (REQ_STARTED set), to avoid false accounting
+ * for request-request merges
+ */
+ if (blk_account_rq(rq)) {
+ q->in_flight++;
+#if 0
+ q->max_in_flight = max(q->in_flight, q->max_in_flight);
+#endif
+ WARN_ON(q->in_flight > 2 * q->nr_requests);
+ }
+
+ /*
* the main clearing point for q->last_merge is on retrieval of
* request by driver (it calls elv_next_request()), but it _can_
* also happen here if a request is added to the queue but later
@@ -316,6 +343,14 @@
void elv_completed_request(request_queue_t *q, struct request *rq)
{
elevator_t *e = &q->elevator;
+
+ /*
+ * request is released from the driver, io must be done
+ */
+ if (blk_account_rq(rq)) {
+ WARN_ON(q->in_flight == 0);
+ q->in_flight--;
+ }

if (e->elevator_completed_req_fn)
e->elevator_completed_req_fn(q, rq);
===== drivers/block/ll_rw_blk.c 1.237 vs edited =====
--- 1.237/drivers/block/ll_rw_blk.c Tue Mar 16 11:29:58 2004
+++ edited/drivers/block/ll_rw_blk.c Tue Mar 30 12:58:51 2004
@@ -313,7 +313,7 @@
* Enables a low level driver to set an upper limit on the size of
* received requests.
**/
-void blk_queue_max_sectors(request_queue_t *q, unsigned short max_sectors)
+void blk_queue_max_sectors(request_queue_t *q, unsigned int max_sectors)
{
if ((max_sectors << 9) < PAGE_CACHE_SIZE) {
max_sectors = 1 << (PAGE_CACHE_SHIFT - 9);
@@ -321,6 +321,14 @@
}

q->max_sectors = max_sectors;
+
+ /*
+ * use a max of 1MB as a starting point
+ */
+ q->optimal_sectors[READ] = max_sectors;
+ if (q->optimal_sectors[READ] > 2048)
+ q->optimal_sectors[READ] = 2048;
+ q->optimal_sectors[WRITE] = q->optimal_sectors[READ];
}

EXPORT_SYMBOL(blk_queue_max_sectors);
@@ -1018,10 +1026,12 @@
return 1;
}

-static int ll_back_merge_fn(request_queue_t *q, struct request *req,
+static int ll_back_merge_fn(request_queue_t *q, struct request *req,
struct bio *bio)
{
- if (req->nr_sectors + bio_sectors(bio) > q->max_sectors) {
+ const int rw = rq_data_dir(req);
+
+ if (req->nr_sectors + bio_sectors(bio) > q->optimal_sectors[rw]) {
req->flags |= REQ_NOMERGE;
q->last_merge = NULL;
return 0;
@@ -1033,10 +1043,12 @@
return ll_new_hw_segment(q, req, bio);
}

-static int ll_front_merge_fn(request_queue_t *q, struct request *req,
+static int ll_front_merge_fn(request_queue_t *q, struct request *req,
struct bio *bio)
{
- if (req->nr_sectors + bio_sectors(bio) > q->max_sectors) {
+ const int rw = rq_data_dir(req);
+
+ if (req->nr_sectors + bio_sectors(bio) > q->optimal_sectors[rw]) {
req->flags |= REQ_NOMERGE;
q->last_merge = NULL;
return 0;
@@ -1053,6 +1065,7 @@
{
int total_phys_segments = req->nr_phys_segments +next->nr_phys_segments;
int total_hw_segments = req->nr_hw_segments + next->nr_hw_segments;
+ const int rw = rq_data_dir(req);

/*
* First check if the either of the requests are re-queued
@@ -1064,7 +1077,7 @@
/*
* Will it become to large?
*/
- if ((req->nr_sectors + next->nr_sectors) > q->max_sectors)
+ if ((req->nr_sectors + next->nr_sectors) > q->optimal_sectors[rw])
return 0;

total_phys_segments = req->nr_phys_segments + next->nr_phys_segments;
@@ -2312,9 +2325,9 @@
__blk_put_request(q, freereq);

if (blk_queue_plugged(q)) {
- int nr_queued = q->rq.count[READ] + q->rq.count[WRITE];
+ int nrq = q->rq.count[READ] + q->rq.count[WRITE] - q->in_flight;

- if (nr_queued == q->unplug_thresh)
+ if (nrq == q->unplug_thresh)
__generic_unplug_device(q);
}
spin_unlock_irq(q->queue_lock);
@@ -2575,6 +2588,74 @@
rq->nr_hw_segments = nr_hw_segs;
}

+#define BLK_IORATE_SAMPLES (512)
+#define BLK_IORATE_ALIGNMASK (7) /* 4kb alignment */
+#define BLK_IORATE_TARGET (50) /* 50ms latency */
+#define BLK_IORATE_ALIGN(x) (x) = (((x) + BLK_IORATE_ALIGNMASK) & ~BLK_IORATE_ALIGNMASK)
+
+static inline void blk_calc_stream_rate(struct request *rq, int nbytes)
+{
+ request_queue_t *q = rq->q;
+ unsigned int iorate, max_sectors;
+ unsigned long duration;
+ struct blk_iorate *ior;
+ int rw;
+
+ if (!blk_fs_request(rq))
+ return;
+ if (unlikely(!q))
+ return;
+
+ rw = rq_data_dir(rq);
+ ior = &q->iorates[rw];
+ duration = jiffies - rq->issue_time;
+
+ ior->io_rate_samples++;
+ if (!(ior->io_rate_samples & (BLK_IORATE_SAMPLES - 1))) {
+ ior->io_rate_samples = 0;
+ ior->best_io_rate = 0;
+ }
+
+ duration = max(jiffies - rq->start_time, 1UL);
+ iorate = nbytes / duration;
+ if (iorate > ior->best_io_rate)
+ ior->best_io_rate = iorate;
+
+ /*
+ * average old optimal sectors with best data rate
+ */
+ if (!ior->io_rate_samples) {
+ struct backing_dev_info *bdi = &q->backing_dev_info;
+ int ra_sec = bdi->ra_pages << (PAGE_CACHE_SHIFT - 9);
+ int ra_sec_max = VM_MAX_READAHEAD << 9;
+
+ max_sectors = (ior->best_io_rate * BLK_IORATE_TARGET) >> 9;
+ q->optimal_sectors[rw] = (q->optimal_sectors[rw] + max_sectors) / 2;
+ BLK_IORATE_ALIGN(q->optimal_sectors[rw]);
+
+ /*
+ * don't exceed the hardware limit
+ */
+ if (q->optimal_sectors[rw] > q->max_sectors)
+ q->optimal_sectors[rw] = q->max_sectors;
+
+ /*
+ * let read-ahead be optimal io size aligned, and no more than
+ * a single request
+ */
+ if (rw == READ) {
+ if (ra_sec > q->optimal_sectors[READ])
+ ra_sec = q->optimal_sectors[READ];
+ if (ra_sec > ra_sec_max)
+ ra_sec = ra_sec_max;
+
+ bdi->ra_pages = ra_sec >> (PAGE_CACHE_SHIFT - 9);
+ }
+
+ printk("%s: %s optimal sectors %u (ra %lu)\n", rq->rq_disk->disk_name, rw == WRITE ? "WRITE" : "READ", q->optimal_sectors[rw], bdi->ra_pages);
+ }
+}
+
void blk_recalc_rq_sectors(struct request *rq, int nsect)
{
if (blk_fs_request(rq)) {
@@ -2628,7 +2709,8 @@
printk("end_request: I/O error, dev %s, sector %llu\n",
req->rq_disk ? req->rq_disk->disk_name : "?",
(unsigned long long)req->sector);
- }
+ } else
+ blk_calc_stream_rate(req, nr_bytes);

total_bytes = bio_nbytes = 0;
while ((bio = req->bio)) {
===== include/linux/blkdev.h 1.138 vs edited =====
--- 1.138/include/linux/blkdev.h Fri Mar 12 10:33:07 2004
+++ edited/include/linux/blkdev.h Tue Mar 30 13:00:51 2004
@@ -122,6 +122,7 @@
struct gendisk *rq_disk;
int errors;
unsigned long start_time;
+ unsigned long issue_time;

/* Number of scatter-gather DMA addr+len pairs after
* physical address coalescing is performed.
@@ -267,6 +268,11 @@
atomic_t refcnt; /* map can be shared */
};

+struct blk_iorate {
+ unsigned int io_rate_samples;
+ unsigned int best_io_rate;
+};
+
struct request_queue
{
/*
@@ -337,11 +343,12 @@
*/
unsigned long nr_requests; /* Max # of requests */

- unsigned short max_sectors;
unsigned short max_phys_segments;
unsigned short max_hw_segments;
unsigned short hardsect_size;
unsigned int max_segment_size;
+ unsigned int max_sectors;
+ unsigned int optimal_sectors[2];

unsigned long seg_boundary_mask;
unsigned int dma_alignment;
@@ -350,6 +357,10 @@

atomic_t refcnt;

+ unsigned short in_flight;
+
+ struct blk_iorate iorates[2];
+
/*
* sg stuff
*/
@@ -378,6 +389,9 @@
#define blk_fs_request(rq) ((rq)->flags & REQ_CMD)
#define blk_pc_request(rq) ((rq)->flags & REQ_BLOCK_PC)
#define blk_noretry_request(rq) ((rq)->flags & REQ_FAILFAST)
+#define blk_rq_started(rq) ((rq)->flags & REQ_STARTED)
+
+#define blk_account_rq(rq) (blk_rq_started(rq) && blk_fs_request(rq))

#define blk_pm_suspend_request(rq) ((rq)->flags & REQ_PM_SUSPEND)
#define blk_pm_resume_request(rq) ((rq)->flags & REQ_PM_RESUME)
@@ -558,7 +572,7 @@
extern void blk_cleanup_queue(request_queue_t *);
extern void blk_queue_make_request(request_queue_t *, make_request_fn *);
extern void blk_queue_bounce_limit(request_queue_t *, u64);
-extern void blk_queue_max_sectors(request_queue_t *, unsigned short);
+extern void blk_queue_max_sectors(request_queue_t *, unsigned int);
extern void blk_queue_max_phys_segments(request_queue_t *, unsigned short);
extern void blk_queue_max_hw_segments(request_queue_t *, unsigned short);
extern void blk_queue_max_segment_size(request_queue_t *, unsigned int);

--
Jens Axboe

2004-03-30 11:34:05

by Kurt Garloff

[permalink] [raw]
Subject: Re: [PATCH] speed up SATA

On Mon, Mar 29, 2004 at 03:08:50PM +0200, Jens Axboe wrote:
> Indeed, it would be best to keep the read-ahead window at least a
> multiple of the max read size. So you don't get the nasty effects of
> having a 128k read-ahead window, but device with 255 sector limit
> resulting in 124KB + 4KB read.

Any work underway to implement this?

Regards,
--
Kurt Garloff <[email protected]> Cologne, DE
SUSE LINUX AG, Nuernberg, DE SUSE Labs (Head)


Attachments:
(No filename) (495.00 B)
(No filename) (189.00 B)
Download all attachments

2004-03-30 11:40:43

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH] speed up SATA

On Tue, Mar 30 2004, Kurt Garloff wrote:
> On Mon, Mar 29, 2004 at 03:08:50PM +0200, Jens Axboe wrote:
> > Indeed, it would be best to keep the read-ahead window at least a
> > multiple of the max read size. So you don't get the nasty effects of
> > having a 128k read-ahead window, but device with 255 sector limit
> > resulting in 124KB + 4KB read.
>
> Any work underway to implement this?

It would be a few lines of code to implement this - see the other patch
I just sent. You just need to adjust backing_dev_info.ra_pages based on
->max_sectors, if you use the current kernels.

--
Jens Axboe

2004-03-30 12:20:38

by Marc Bevand

[permalink] [raw]
Subject: Re: [PATCH] speed up SATA

Jeff Garzik wrote:
>
> [...]
> With this simple patch, the max request size goes from 128K to 32MB...
> so you can imagine this will definitely help performance. Throughput
> goes up. Interrupts go down. Fun for the whole family.
> [...]

I have experienced a noticeable improvement concerning the CPU usage
and disk throughput with this patch.

Benchmark specs:

o read from only 1 disk (sda), or from 2 disks (sda+sdb), with
1 or 2 instances of "dd if=/dev/sd? of=/dev/null bs=100M".
o hardware: two Seagate 160GB SATA, on a Silicon Image 3114, on a
32-bit/33MHz PCI bus, 1GB RAM.
o software: kernel 2.6.5-rc2-bk6-libata2.

Benchmark datas:

without the speed-up-sata patch with the speed-up-sata patch
reading sda reading sda+sdb reading sda reading sda+sdb
bi 57000 92000 57000 97000
in 1900 2400 1600 1800
cs 1800 3300 1400 1700
sy 11% 20% 9% 16%
us 0% 0% 0% 0%

("bi, in, cs, sy, us" have been reported by vmstat(8))

When reading only from sda, the speed-up-sata patch makes the number of
interrupts/s drop from 1900 to 1600 (CPU usage: 11% to 9%). The throughput
does not improve because 57000 blocks/s is the physical limit of the hardisk.

When reading from both sda and sdb, the improvement is more visible: the
number of interrupts/s goes from 2400 to 1800 (CPU usage: 20% to 16%). But
in this case, the throughput improves from 92000 blocks/s to 97000 blocks/s.
I think I am reaching the physical limit of the PCI bus (theoretically it
would be 133 MB/s or 133000 blocks/s). When setting the PCI latency timer of
the SiI3114 controller to 240 (was 64), I am able to reach 100000 blocks/s.

As other people were complaining that the 32MB max request size might be too
high, I did give a try to 1MB (by replacing "65534" by "2046" in the patch).
There is no visible differences between 32MB and 1MB.

PS: Jeff: "pci_dma_mapping_error()", in libata-core.c from your latest
2.6-libata patch, is an unresolved symbol. I have had to comment it out
to be able to compile the kernel.

--
Marc Bevand

2004-03-30 13:07:12

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH] speed up SATA

On Tue, Mar 30 2004, Marc Bevand wrote:
> Jeff Garzik wrote:
> >
> >[...]
> >With this simple patch, the max request size goes from 128K to 32MB...
> >so you can imagine this will definitely help performance. Throughput
> >goes up. Interrupts go down. Fun for the whole family.
> >[...]
>
> I have experienced a noticeable improvement concerning the CPU usage
> and disk throughput with this patch.
>
> Benchmark specs:
>
> o read from only 1 disk (sda), or from 2 disks (sda+sdb), with
> 1 or 2 instances of "dd if=/dev/sd? of=/dev/null bs=100M".
> o hardware: two Seagate 160GB SATA, on a Silicon Image 3114, on a
> 32-bit/33MHz PCI bus, 1GB RAM.
> o software: kernel 2.6.5-rc2-bk6-libata2.
>
> Benchmark datas:
>
> without the speed-up-sata patch with the speed-up-sata patch
> reading sda reading sda+sdb reading sda reading sda+sdb
> bi 57000 92000 57000 97000
> in 1900 2400 1600 1800
> cs 1800 3300 1400 1700
> sy 11% 20% 9% 16%
> us 0% 0% 0% 0%
>
> ("bi, in, cs, sy, us" have been reported by vmstat(8))
>
> When reading only from sda, the speed-up-sata patch makes the number of
> interrupts/s drop from 1900 to 1600 (CPU usage: 11% to 9%). The throughput
> does not improve because 57000 blocks/s is the physical limit of the
> hardisk.
>
> When reading from both sda and sdb, the improvement is more visible: the
> number of interrupts/s goes from 2400 to 1800 (CPU usage: 20% to 16%). But
> in this case, the throughput improves from 92000 blocks/s to 97000 blocks/s.
> I think I am reaching the physical limit of the PCI bus (theoretically it
> would be 133 MB/s or 133000 blocks/s). When setting the PCI latency timer of
> the SiI3114 controller to 240 (was 64), I am able to reach 100000 blocks/s.

Good that somebody did some testing on this, thanks :-)

> As other people were complaining that the 32MB max request size might be too
> high, I did give a try to 1MB (by replacing "65534" by "2046" in the patch).
> There is no visible differences between 32MB and 1MB.

As suspected. BTW, you want to use 2048 there, not 2046. The 64K-2
(which could be 64K-1) is just due to ->max_sectors being an unsigned
short currently.

--
Jens Axboe

2004-03-30 13:49:52

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH] speed up SATA

On Tue, Mar 30 2004, Marc Bevand wrote:
> On 30 Mar 2004, Jens Axboe wrote:
> | On Tue, Mar 30 2004, Marc Bevand wrote:
> | [...]
> | > As other people were complaining that the 32MB max request size might be too
> | > high, I did give a try to 1MB (by replacing "65534" by "2046" in the patch).
> | > There is no visible differences between 32MB and 1MB.
> |
> | As suspected. BTW, you want to use 2048 there, not 2046. The 64K-2
> | (which could be 64K-1) is just due to ->max_sectors being an unsigned
> | short currently.
>
> Okay, I thought the 2 sectors were used to store some extra
> informations.
>
> I also forgot to mention that there does not seem to have any
> *latency* differences between 32MB and 1MB. Strange...

That's because you are not generating requests that big anyways.

--
Jens Axboe

2004-03-30 13:48:23

by Marc Bevand

[permalink] [raw]
Subject: Re: [PATCH] speed up SATA

On 30 Mar 2004, Jens Axboe wrote:
| On Tue, Mar 30 2004, Marc Bevand wrote:
| [...]
| > As other people were complaining that the 32MB max request size might be too
| > high, I did give a try to 1MB (by replacing "65534" by "2046" in the patch).
| > There is no visible differences between 32MB and 1MB.
|
| As suspected. BTW, you want to use 2048 there, not 2046. The 64K-2
| (which could be 64K-1) is just due to ->max_sectors being an unsigned
| short currently.

Okay, I thought the 2 sectors were used to store some extra informations.

I also forgot to mention that there does not seem to have any *latency*
differences between 32MB and 1MB. Strange...

--
Marc Bevand http://www.epita.fr/~bevand_m
Computer Science School EPITA - System, Network and Security Dept.

2004-03-30 15:32:11

by Jeff Garzik

[permalink] [raw]
Subject: Re: [PATCH] speed up SATA

Marc Bevand wrote:
> Jeff Garzik wrote:
>
>>
>> [...]
>> With this simple patch, the max request size goes from 128K to 32MB...
>> so you can imagine this will definitely help performance. Throughput
>> goes up. Interrupts go down. Fun for the whole family.
>> [...]
>
>
> I have experienced a noticeable improvement concerning the CPU usage
> and disk throughput with this patch.
>
> Benchmark specs:
>
> o read from only 1 disk (sda), or from 2 disks (sda+sdb), with
> 1 or 2 instances of "dd if=/dev/sd? of=/dev/null bs=100M".
> o hardware: two Seagate 160GB SATA, on a Silicon Image 3114, on a
> 32-bit/33MHz PCI bus, 1GB RAM.
> o software: kernel 2.6.5-rc2-bk6-libata2.

[...]

Very cool, thanks for benching!


> As other people were complaining that the 32MB max request size might be
> too
> high, I did give a try to 1MB (by replacing "65534" by "2046" in the
> patch).
> There is no visible differences between 32MB and 1MB.

This is not surprising, since:
* the scatter-gather table imposes a limit of 8MB
* the VM further imposes limits on readahead and writeback

So 32MB is the hardware max, but not really achieveable due to other
factors.


> PS: Jeff: "pci_dma_mapping_error()", in libata-core.c from your latest
> 2.6-libata patch, is an unresolved symbol. I have had to comment it out
> to be able to compile the kernel.

Yeah, this is only found in the bleeding-edge-latest 2.6.x kernels. You
did the right thing...

Regards,

Jeff



2004-03-30 15:36:04

by Timothy Miller

[permalink] [raw]
Subject: Re: [PATCH] speed up SATA



Jens Axboe wrote:

>
>
> Here's a quickly done patch that attempts to adjust the value based on a
> previous range of completed requests. It changes ->max_sectors to be a
> hardware limit, adding ->optimal_sectors to be our max issued io target.
> It is split on READ and WRITE. The target is to keep request execution
> time under BLK_IORATE_TARGET, which is 50ms in this patch. read-ahead
> max window is kept within a single request in size.
>
[snip]

Awesome patch!

One question I have is about how you are determining optimal request
size. You are basing this on the _maximum_ performance of the device,
but there could be huge differences between min and max performance.
Would it be better to compute optimal based on worst-case performance?
Or averaged over time?

Idealy, we'd have a running average where one hit of worst-case will
have an impact, but if that happens only once, it'll drop out of the
average, and best-case events will be considered but tempered as well.

If we're going for real-time, then we want to avoid any latency over
50ms. In that case, we'd want to go based on worst-case, although it
would still be good to let, say, an occasional hardware glitch which had
slow response to not have permanent impact.

2004-03-30 16:20:31

by Jeff Garzik

[permalink] [raw]
Subject: Re: [PATCH] speed up SATA

Jens Axboe wrote:
> Here's a quickly done patch that attempts to adjust the value based on a
> previous range of completed requests. It changes ->max_sectors to be a
> hardware limit, adding ->optimal_sectors to be our max issued io target.
> It is split on READ and WRITE. The target is to keep request execution
> time under BLK_IORATE_TARGET, which is 50ms in this patch. read-ahead
> max window is kept within a single request in size.
>
> So this is pretty half-assed, but it gets the point across. Things that
> should be looked at (meaning - should be done, but I didn't want to
> waste time on them now):
>
> - I added the hook to see how many in-drive queued requests you have, so
> there's the possibility to add tcq knowledge as well.
>
> - The optimal_sectors calculation is just an average of previous maximum
> with current maximum. The scheme has a number of break down points,
> for instance writes starving reads will give you a bad read execution
> time, further limiting ->optimal_sectors[READ]


Makes me pretty happy... :)

Jeff



2004-03-30 17:43:10

by Jeff Garzik

[permalink] [raw]
Subject: Re: [PATCH] speed up SATA

Marc Bevand wrote:
> I think I am reaching the physical limit of the PCI bus (theoretically it
> would be 133 MB/s or 133000 blocks/s). When setting the PCI latency
> timer of
> the SiI3114 controller to 240 (was 64), I am able to reach 100000 blocks/s.

That's interesting.

I wonder if we should look at making pci_set_master()'s latency timer
setting code be a bit smarter.

It (pcibios_set_master in arch/i386/pci/i386.c) current checks the
latency timer value programmed by the BIOS. If the BIOS did not
initialize the value, then it is set to 64. Otherwise, it is clamped to
the maximum 255.

I wonder if your BIOS shouldn't increase that latency timer value...?

Jeff



2004-03-30 17:48:03

by Timothy Miller

[permalink] [raw]
Subject: Re: [PATCH] speed up SATA



Jens Axboe wrote:

> - The optimal_sectors calculation is just an average of previous maximum
> with current maximum. The scheme has a number of break down points,
> for instance writes starving reads will give you a bad read execution
> time, further limiting ->optimal_sectors[READ]


I did look through your code a bit, but one thing to be concerned with
is that going only on max throughput might be fooled by cache hits on
the drive.

2004-03-30 17:52:06

by Jeff Garzik

[permalink] [raw]
Subject: Re: [PATCH] speed up SATA

Timothy Miller wrote:
>
>
> Jens Axboe wrote:
>
>> - The optimal_sectors calculation is just an average of previous maximum
>> with current maximum. The scheme has a number of break down points,
>> for instance writes starving reads will give you a bad read execution
>> time, further limiting ->optimal_sectors[READ]
>
>
>
> I did look through your code a bit, but one thing to be concerned with
> is that going only on max throughput might be fooled by cache hits on
> the drive.


If you are taking your samples over time, that shouldn't matter... if
the system workload is such that you are hitting the drive cache the
majority of the time, you're not being "fooled" by cache hits, the patch
would be taking those cache hits into account.

If the system isn't hitting the drive cache the majority of the time,
statistical sampling will automatically notice that too...

Jeff



2004-03-30 18:00:42

by Timothy Miller

[permalink] [raw]
Subject: Re: [PATCH] speed up SATA



Jeff Garzik wrote:
>
>
>
> If you are taking your samples over time, that shouldn't matter... if
> the system workload is such that you are hitting the drive cache the
> majority of the time, you're not being "fooled" by cache hits, the patch
> would be taking those cache hits into account.
>
> If the system isn't hitting the drive cache the majority of the time,
> statistical sampling will automatically notice that too...
>


I completely agree, although Jens' patch seems to try to learn the
drive's maximum speed and go based on that. Maybe I misread the code.
Anyhow, it's certainly excellent for a starting point... it's this sort
of proof-of-concept that gets the ball rolling. Plus, it's already
better than Jens says it is. :)


2004-03-31 05:46:52

by Marcus Hartig

[permalink] [raw]
Subject: Re: [PATCH] speed up SATA

Hallo,

I did not get it work with my Maxtor SATA 80GB 6Y080M0.

With 2.6.5-rc2/3 or mm1 no chance to get the higher request size. Is
this not possible with the the smaller drives and libata? Im not
sure if this disk has the the lba48 feature?

Abit NF7-S with an SilImage 3112a is here under Fedora
running.


please cc me,

Marcus

2004-03-31 06:56:37

by Jeff Garzik

[permalink] [raw]
Subject: Re: [PATCH] speed up SATA

Marcus Hartig wrote:
> Hallo,
>
> I did not get it work with my Maxtor SATA 80GB 6Y080M0.

Can you post the relevant 'dmesg' output?


> With 2.6.5-rc2/3 or mm1 no chance to get the higher request size. Is
> this not possible with the the smaller drives and libata? Im not
> sure if this disk has the the lba48 feature?

If it does not have lba48, then the patch should print out "request size
128K" or similar.


> Abit NF7-S with an SilImage 3112a is here under Fedora
> running.

That should be fine with lba48...

Jeff



2004-03-31 09:13:12

by Marc Bevand

[permalink] [raw]
Subject: Re: [PATCH] speed up SATA

On 30 Mar 2004, Jeff Garzik wrote:
| Marc Bevand wrote:
| >I think I am reaching the physical limit of the PCI bus (theoretically it
| >would be 133 MB/s or 133000 blocks/s). When setting the PCI latency
| >timer of
| >the SiI3114 controller to 240 (was 64), I am able to reach 100000 blocks/s.
|
| That's interesting.
|
| I wonder if we should look at making pci_set_master()'s latency timer
| setting code be a bit smarter.

AFAIK choosing the optimal latency timer for each device on a PCI bus is
not a trivial thing, one needs to take into account a lots of things.
But making it a "bit smarter" would be "good enough".

| It (pcibios_set_master in arch/i386/pci/i386.c) current checks the

Actually my arch is x86_64 ;-) But I guess the code is very similar.

| latency timer value programmed by the BIOS. If the BIOS did not
| initialize the value, then it is set to 64. Otherwise, it is clamped to
| the maximum 255.
|
| I wonder if your BIOS shouldn't increase that latency timer value...?

My BIOS seems to always initialize the latency timer. There is a menu
in which one can choose the value (32, 64, 96, etc), and the default
setting (when "loading failsafe settings" or "loading optimized
settings") is 64 (that is where the value is coming from). The BIOS
does not offer an "auto" value that would be computed dynamically for
optimal performances.

--
Marc Bevand http://www.epita.fr/~bevand_m
Computer Science School EPITA - System, Network and Security Dept.

2004-03-31 16:16:45

by Marcus Hartig

[permalink] [raw]
Subject: Re: [PATCH] speed up SATA

Jeff Garzik wrote:

> Can you post the relevant 'dmesg' output?

Linux version 2.6.5-rc3-mm1 (cowboy@redtuxi) (gcc-Version 3.3.3 20040311
(Red Hat Linux 3.3.3-3))
...
Kernel command line: ro root=/dev/sda1 doataraid noraid acpi=off rhgb
...
libata version 1.02 loaded.
sata_sil version 0.54
ata1: SATA max UDMA/100 cmd 0xE0800080 ctl 0xE080008A bmdma 0xE0800000 irq 4
ata2: SATA max UDMA/100 cmd 0xE08000C0 ctl 0xE08000CA bmdma 0xE0800008 irq 4
ata1: dev 0 cfg 49:2f00 82:7c6b 83:7b09 84:4003 85:7c69 86:3a01 87:4003
88:207f
ata1: dev 0 ATA, max UDMA/133, 160086528 sectors
ata1: dev 0 configured for UDMA/100
scsi0 : sata_sil
ata2: no device found (phy stat 00000000)
ata2: thread exiting
scsi1 : sata_sil
Vendor: ATA Model: Maxtor 6Y080M0 Rev: 1.02
Type: Direct-Access ANSI SCSI revision: 05
ata1: dev 0 max request 128K
SCSI device sda: 160086528 512-byte hdwr sectors (81964 MB)
SCSI device sda: drive cache: write through
sda: sda1 sda2 sda3 sda4 < sda5 >
Attached scsi disk sda at scsi0, channel 0, id 0, lun 0
------------------%<------------------------------------

Also my Maxtor is now configured for UDMA/100. With earlier versions of
libata/sata_sil it was UDMA/133 wo any problems. The performance is now
dropped a little bit from ~52MB to ~48MB with hdparm.
Best regards,

Marcus

2004-04-02 10:13:22

by Jeremy Higdon

[permalink] [raw]
Subject: Re: [PATCH] speed up SATA

On Sun, Mar 28, 2004 at 06:36:23PM +0100, Jamie Lokier wrote:
>
> This is what I mean: turn off write cacheing, and performance on PATA
> drops because of the serialisation and lost inter-command time.

Since you have to write the sectors in order (well, you don't have
to, but the drives all do this), you lose a rev between each write
when you don't queue commands or have write cacheing.

> With TCQ-on-write, you can turn off write cacheing and in theory
> performance doesn't have to drop, is that right?

Correct. I have proven this to my satisfaction.


Regarding request sizes, the main benefit to very large (> 1MB)
request sizes to disk drives is elimination of the lost revolution
when write cacheing is disabled and you can't queue back-to-
back writes.

For hardware RAID, they frequently need 4MB or more to hit
peak performance.


I agree that we don't want a host driver to arbitrarily limit
I/O size, because it's difficult to predict what may be connected
to it. It may be necessary due to deficiencies elsewhere, but
not desired.

jeremy

2004-04-02 16:16:54

by Jamie Lokier

[permalink] [raw]
Subject: Re: [PATCH] speed up SATA

Jeremy Higdon wrote:
> > This is what I mean: turn off write cacheing, and performance on PATA
> > drops because of the serialisation and lost inter-command time.
>
> Since you have to write the sectors in order (well, you don't have
> to, but the drives all do this), you lose a rev between each write
> when you don't queue commands or have write cacheing.

I don't see how the driver can write the sectors out of order, if
there is no TCQ (we're talking PATA) and every write must be committed
before it's acknowledged (write cache disabled).

> > With TCQ-on-write, you can turn off write cacheing and in theory
> > performance doesn't have to drop, is that right?
>
> Correct. I have proven this to my satisfaction.

Are you refuting the following assertion by Eric D. Mudama's, based on
your measurements? In other words, are ATA's 32 TCQ slots enough to
eliminate the performance advantage of write cacheing?

Eric D. Mudama <[email protected]> wrote:
> However, cached writes (queued or unqueued), especially small ones,
> will have WAY higher ops/sec. ATA TCQ is limited to 32 choices for
> the next best operation, but an 8MB buffer doing 4K random-ops could
> potentially have ~2000 choices for the next thing to do. (assuming
> perfect cache granularity, etc, etc)
>
> At 32 choices, the seek and rotate are still somewhat significant,
> though way better than unqueued. With 2000 things to choose from, the
> drive is never, ever idle. Seek, land just-in-time, read/write,
> rinse/repeat.

Thanks,
-- Jamie

2004-04-03 10:50:26

by Jeremy Higdon

[permalink] [raw]
Subject: Re: [PATCH] speed up SATA

On Fri, Apr 02, 2004 at 05:11:49PM +0100, Jamie Lokier wrote:
> Jeremy Higdon wrote:
> > > This is what I mean: turn off write cacheing, and performance on PATA
> > > drops because of the serialisation and lost inter-command time.
> >
> > Since you have to write the sectors in order (well, you don't have
> > to, but the drives all do this), you lose a rev between each write
> > when you don't queue commands or have write cacheing.
>
> I don't see how the driver can write the sectors out of order, if
> there is no TCQ (we're talking PATA) and every write must be committed
> before it's acknowledged (write cache disabled).
>
> > > With TCQ-on-write, you can turn off write cacheing and in theory
> > > performance doesn't have to drop, is that right?
> >
> > Correct. I have proven this to my satisfaction.
>
> Are you refuting the following assertion by Eric D. Mudama's, based on
> your measurements? In other words, are ATA's 32 TCQ slots enough to
> eliminate the performance advantage of write cacheing?

I must apologize. I had thought the context was SCSI, but now I
see that it is linux-ide. So please disregard comments about command
queueing. If you have write cache disabled and no TCQ (latter is common,
former may or may not be), you want to write as big a chunk as you can.

I'm sorry about the confusion.

jeremy

2004-04-03 13:49:44

by Jamie Lokier

[permalink] [raw]
Subject: Re: [PATCH] speed up SATA

Jeremy Higdon wrote:
> > Are you refuting the following assertion by Eric D. Mudama's, based on
> > your measurements? In other words, are ATA's 32 TCQ slots enough to
> > eliminate the performance advantage of write cacheing?
>
> I must apologize. I had thought the context was SCSI, but now I
> see that it is linux-ide. So please disregard comments about command
> queueing.

We're talking about _with_ TCQ, on serial ATA (SATA) where TCQ is common,
and deliberately disabling the write cache so that TCQ completions
indicate when the data is written.

If your measurements indicate that 32 queue slots is adequate with
SCSI drives to eliminate the overhead of disabling write cacheing,
then that's valuable information. The drives are basically the same
after all.

> If you have write cache disabled and no TCQ (latter is common,
> former may or may not be), you want to write as big a chunk as you can.

That's a handy insight too. Personally I hadn't thought about large
requests reducing the rotation latency with write cache disabled.

-- Jamie