Just to echo some comments I said in private, this driver is _not_
a replacement for drivers/ide. This is not, and has never been,
the intention. In fact, I need drivers/ide's continued existence,
so that I may have fewer boundaries on future development.
Even though ATAPI support doesn't exist and error handling is
primitive, this driver has been extensively tested locally and I feel
is ready for a full and public kernel developer assault :)
James ok'd sending this... I'll be sending "un-hack scsi headers" patch
through him via his scsi-misc-2.5 tree.
Linus, please do a
bk pull bk://kernel.bkbits.net/jgarzik/scsi-2.5
Others may download the patch from
ftp://ftp.kernel.org/pub/linux/kernel/people/jgarzik/patchkits/2.5/2.5.69-bk18-scsi1.patch.bz2
This will update the following files:
drivers/scsi/Kconfig | 27
drivers/scsi/Makefile | 1
drivers/scsi/ata_piix.c | 322 ++++++
drivers/scsi/libata.c | 2247 ++++++++++++++++++++++++++++++++++++++++++++++++
include/linux/ata.h | 485 ++++++++++
5 files changed, 3082 insertions(+)
through these ChangeSets:
<[email protected]> (03/05/26 1.1357)
[scsi ata] make PATA config option actually do something useful
<[email protected]> (03/05/26 1.1356)
[scsi ata] include hacks, b/c scsi headers not in include/linux
<[email protected]> (03/05/26 1.1355)
[scsi] add ATA driver
On Mon, 26 May 2003, Jeff Garzik wrote:
>
> Just to echo some comments I said in private, this driver is _not_
> a replacement for drivers/ide. This is not, and has never been,
> the intention. In fact, I need drivers/ide's continued existence,
> so that I may have fewer boundaries on future development.
Just out of interest, is there any _point_ to this driver? I can
appreciate the approach, but I'd like to know if it does anything (at all)
better than the native IDE driver? Faster? Anything?
Linus
Linus Torvalds wrote:
> On Mon, 26 May 2003, Jeff Garzik wrote:
>
>>Just to echo some comments I said in private, this driver is _not_
>>a replacement for drivers/ide. This is not, and has never been,
>>the intention. In fact, I need drivers/ide's continued existence,
>>so that I may have fewer boundaries on future development.
>
>
> Just out of interest, is there any _point_ to this driver? I can
> appreciate the approach, but I'd like to know if it does anything (at all)
> better than the native IDE driver? Faster? Anything?
Direction: SATA is much more suited to SCSI, because otherwise you wind
up re-creating all the queueing and error handling mess that SCSI
already does for you. The SATA2 host controllers coming out soon do
full host-side TCQ, not the dain-bramaged ATA TCQ bus-release stuff.
Doing SATA2 devel in drivers/ide will essentially be re-creating the
SCSI mid-layer.
Modularity: drivers/ide has come a long way. It needed to be turned
"inside out", and that's what Alan did. But there's still a lot of code
that needs to be factored out/about, before hotplugging and device model
stuff is sane.
Legacy-free: Because I don't have to worry about legacy host
controllers, I can ignore limitations drivers/ide cannot. In
drivers/ide, each host IO (PIO/MMIO) is done via function pointer. If
your arch has a mach_vec, more function pointers. Mine does direct
calls to the asm/io.h functions in faster. So, ATA command submission
is measureably faster.
sysfs: James and co are putting time into getting scsi sysfs right. I
would rather ride their coattails, and have my driver Just Work with
sysfs and the driver model.
PIO data transfer is faster and more scheduler-friendly, since it polls
from a kernel thread.
And for specifically Intel SATA, drivers/ide flat out doesn't work (even
though it claims to).
So, I conclude: faster, smaller, and better future direction. IMO, of
course :)
Jeff
(the following is somewhat comparing apples to oranges, but I like doing
it nonetheless)
bash-2.05b$ wc -l drivers/scsi/libata.c drivers/scsi/ata_piix.c
include/linux/ata.h
2247 drivers/scsi/libata.c
322 drivers/scsi/ata_piix.c
485 include/linux/ata.h
3054 total
bash-2.05b$ wc -l drivers/ide/*.[ch] drivers/ide/pci/piix.[ch]
3418 drivers/ide/ide-cd.c
733 drivers/ide/ide-cd.h
71 drivers/ide/ide-default.c
1841 drivers/ide/ide-disk.c
1145 drivers/ide/ide-dma.c
2090 drivers/ide/ide-floppy.c
135 drivers/ide/ide-geometry.c
1330 drivers/ide/ide-io.c
1322 drivers/ide/ide-iops.c
437 drivers/ide/ide-lib.c
97 drivers/ide/ide-pnp.c
1466 drivers/ide/ide-probe.c
930 drivers/ide/ide-proc.c
6330 drivers/ide/ide-tape.c
2006 drivers/ide/ide-taskfile.c
798 drivers/ide/ide-tcq.c
281 drivers/ide/ide-timing.h
2539 drivers/ide/ide.c
41 drivers/ide/ide_modes.h
899 drivers/ide/setup-pci.c
840 drivers/ide/pci/piix.c
317 drivers/ide/pci/piix.h
29066 total
Jeff Garzik wrote:
> So, I conclude: faster, smaller, and better future direction. IMO, of
> course :)
As a FWIW, this driver is mainly intended as a Serial ATA driver.
It just happens to do PATA by coincedence (i.e. because it makes devel
easier for me).
For example, current Intel SATA is "PATA, but without the timing junk."
I think with SATA + drivers/ide, you reach a point of diminishing
returns versus amount of time spent on mid-layer coding.
Jeff
On Mon, 26 May 2003, Jeff Garzik wrote:
>
> Direction: SATA is much more suited to SCSI, because otherwise you wind
> up re-creating all the queueing and error handling mess that SCSI
> already does for you.
Last I looked, the SCSI interfaces were much nastier than the native
queueing, and if there is anything missing I'd rather put it at that
layer, instead of making everything use the SCSI layer.
Because when you talk about error handling messes, you're talking SCSI.
THAT is messy. At least judging by the fact that a lot of SCSI drivers
don't seem to get it right.
In other words: I'd really like to see what you can do with a _native_
request queue driver, and what the probloems are. And maybe port _those_
parts of SCSI to it.
> And for specifically Intel SATA, drivers/ide flat out doesn't work (even
> though it claims to).
Well, I don't think it claimed to, until today. Still doesn't work?
Linus
On Mon, 26 May 2003, Jeff Garzik wrote:
>
> I think with SATA + drivers/ide, you reach a point of diminishing
> returns versus amount of time spent on mid-layer coding.
I think that's a valid approach, and just have a special driver for SATA.
That's not the part I worry about.
The part I worry about is the SCSI layer itself, and also potential user
confusion.
Linus
Linus Torvalds wrote:
> On Mon, 26 May 2003, Jeff Garzik wrote:
>
>>Direction: SATA is much more suited to SCSI, because otherwise you wind
>>up re-creating all the queueing and error handling mess that SCSI
>>already does for you.
>
>
> Last I looked, the SCSI interfaces were much nastier than the native
> queueing, and if there is anything missing I'd rather put it at that
> layer, instead of making everything use the SCSI layer.
The SCSI mid-layer has quite flexible command submission and
synchronization. Since each SATA host controller I've seen so far
differs in its queueing implementation and limits, this fits perfectly
with the existing SCSI set up.
So, short-term, I save a ton of code over a SATA block driver.
Long-term, the SCSI mid-layer benefits from the developer attention and
becomes more lightweight as we push some generic concepts upwards into
the block layer. Just look at how far scsi mid-layer has come in 2.5,
versus 2.4! Much more lightweight even now.
So, short-term I disagree. Long-term, I actually agree w/ you, in an
indirect sorta way :)
> Because when you talk about error handling messes, you're talking SCSI.
> THAT is messy. At least judging by the fact that a lot of SCSI drivers
> don't seem to get it right.
This isn't an issue for me. I have to do my own error handling, in
fact. I just define ->eh_strategy_handler, and do my own thing. SCSI
mid-layer nicely provides a kernel thread and command synchronization
before and after it calls ->eh_strategy_handler.
> In other words: I'd really like to see what you can do with a _native_
> request queue driver, and what the probloems are. And maybe port _those_
> parts of SCSI to it.
I considered a native block driver, or perhaps a native block driver for
SATA and SCSI pass-thru for SATAPI.
Actually getting down to coding, I see it as a huge amount of work for
little gain. You have to consider all the userspace interfaces, sysfs
and device model support that wants coding, -after- you're done with the
basic SATA block driver. Userland proggies already exist for scsi.
(more on this in next reply)
>>And for specifically Intel SATA, drivers/ide flat out doesn't work (even
>>though it claims to).
>
>
> Well, I don't think it claimed to, until today. Still doesn't work?
Nope. Not before or after. (even without the patch, it should have
worked in some amount of compat mode)
Jeff
On Mon, 2003-05-26 at 06:58, Jeff Garzik wrote:
> Just to echo some comments I said in private, this driver is _not_
> a replacement for drivers/ide. This is not, and has never been,
> the intention. In fact, I need drivers/ide's continued existence,
> so that I may have fewer boundaries on future development.
>
> Even though ATAPI support doesn't exist and error handling is
> primitive, this driver has been extensively tested locally and I feel
> is ready for a full and public kernel developer assault :)
>
> James ok'd sending this... I'll be sending "un-hack scsi headers" patch
> through him via his scsi-misc-2.5 tree.
Btw, Jeff, while I agree about not boring about old PATA hardware,
I'd still like to see support for MDMA modes in there. For once,
there is no real difference in supporting both UDMA and MDMA, it's
really just a matter of extending the range of the "mode" parameter,
and I'd like to be able to use the driver on configs like pmacs which
typically have a U/DMA capable channel for the internal HD and one
MDMA only channel (ATAPI CD/DVD, ZIP) without having to play bad tricks
to get both drivers up.
Ben.
Linus Torvalds wrote:
> On Mon, 26 May 2003, Jeff Garzik wrote:
>
>>I think with SATA + drivers/ide, you reach a point of diminishing
>>returns versus amount of time spent on mid-layer coding.
>
>
> I think that's a valid approach, and just have a special driver for SATA.
> That's not the part I worry about.
You're still at a point of diminishing returns for a native driver too.
ATAPI support would be a pain. Can't use drivers/ide/* nor ide-scsi.
Userland support is nonexistent, and users would have start using yet
another set of tools to deal with their disks, instead of the existing
[scsi] ones.
Device model, power management, hotplugging are all handled or
in-the-works for scsi, and they are exactly what SATA needs. I would
have to recreate all that handling from scratch.
I think you're missing how much code and pain is actually saved by using
the scsi layer... like I mentioned in the last message, long-term,
these issues can (and should) be solved by moving some of this stuff out
of the scsi layer into the block layer, creating an overall desirable
flattening effect of all code involved.
> The part I worry about is the SCSI layer itself, and also potential user
> confusion.
I think this can only benefit the scsi layer and continue its trend of
becoming more lightweight and lib'ified.
WRT user confusion, I agree: that's why PATA is a config option. I
would prefer people think "SATA? use my driver. PATA? use drivers/ide"
Andre suggested that I might support PCI PATA controllers in my driver
"officially", but I think user confusion and transition issues rule out
that, for non-hackers.
Jeff
Benjamin Herrenschmidt wrote:
> Btw, Jeff, while I agree about not boring about old PATA hardware,
> I'd still like to see support for MDMA modes in there. For once,
> there is no real difference in supporting both UDMA and MDMA, it's
> really just a matter of extending the range of the "mode" parameter,
> and I'd like to be able to use the driver on configs like pmacs which
> typically have a U/DMA capable channel for the internal HD and one
> MDMA only channel (ATAPI CD/DVD, ZIP) without having to play bad tricks
> to get both drivers up.
We'll see... I would really like to ignore MDMA. For the reasons just
outlined in replies to Linus, PATA support is quite useful in limited
situations, but is not at all the focus of the driver. When situations
arise where special support for some PATA situation is needed, I will
most likely just say "use drivers/ide"...
Jeff
Jeff Garzik wrote:
> Linus Torvalds wrote:
>> In other words: I'd really like to see what you can do with a _native_
>> request queue driver, and what the probloems are. And maybe port
>> _those_ parts of SCSI to it.
> Actually getting down to coding, I see it as a huge amount of work for
> little gain. You have to consider all the userspace interfaces, sysfs
> and device model support that wants coding, -after- you're done with the
> basic SATA block driver. Userland proggies already exist for scsi.
Tangenting a bit:
What does the block layer need, that it doesn't have now?
Each major subsystem right now is re-creating driver classes (floppy,
tape, disk, ...) and that should be moved up and made more general.
People should use /dev/diskX, /dev/floppyX, /dev/cdromX etc. without
having to worry about IDE or SCSI or Jeff's SATA or whatever.
Registration of majors/minors/CIDR/etc. sysfs. host
controller-specific struct request_queue settings. All these have to be
recreated each time a new native block driver is created. If you peer
closely at some scsi drivers already merged, you see that too chose to
translate SCSI rather than re-create all the junk necessary for a native
block driver. I'm definitely not the first to do what I did. :)
When such code moves up into the block layer, then we can have a unified
userland for doing class-specific activities, and then have a few
bus-specific tools for the ATA- or SCSI-specific needs not met by
class-general code.
As it stands now, I think a new native block driver creates more user
confusion and pain than it solves, in addition to saving developer pain,
if you consider all the device classes and userland tools and such that
need writing.
Jeff
On Mon, 26 May 2003, Jeff Garzik wrote:
> Linus Torvalds wrote:
> > On Mon, 26 May 2003, Jeff Garzik wrote:
> >
> >>Just to echo some comments I said in private, this driver is _not_
> >>a replacement for drivers/ide. This is not, and has never been,
> >>the intention. In fact, I need drivers/ide's continued existence,
> >>so that I may have fewer boundaries on future development.
> >
> >
> > Just out of interest, is there any _point_ to this driver? I can
> > appreciate the approach, but I'd like to know if it does anything (at all)
> > better than the native IDE driver? Faster? Anything?
>
>
> Direction: SATA is much more suited to SCSI, because otherwise you wind
> up re-creating all the queueing and error handling mess that SCSI
> already does for you. The SATA2 host controllers coming out soon do
> full host-side TCQ, not the dain-bramaged ATA TCQ bus-release stuff.
> Doing SATA2 devel in drivers/ide will essentially be re-creating the
> SCSI mid-layer.
And now you are recreating ATA in SCSI ;-).
Don't get me wrong: I like idea very much, but why you can't
share common code between drivers/ide and your ATA-SCSI.
> Modularity: drivers/ide has come a long way. It needed to be turned
> "inside out", and that's what Alan did. But there's still a lot of code
> that needs to be factored out/about, before hotplugging and device model
> stuff is sane.
Its getting close.
> Legacy-free: Because I don't have to worry about legacy host
> controllers, I can ignore limitations drivers/ide cannot. In
> drivers/ide, each host IO (PIO/MMIO) is done via function pointer. If
> your arch has a mach_vec, more function pointers. Mine does direct
> calls to the asm/io.h functions in faster. So, ATA command submission
> is measureably faster.
I think it is simply wrong, you should use function pointers.
You can have ie. two PCI hosts, one using PIO and one using MMIO.
"measureably faster", I doubt.
IO operations are REALLY slow when compared to CPU cycles.
> sysfs: James and co are putting time into getting scsi sysfs right. I
> would rather ride their coattails, and have my driver Just Work with
> sysfs and the driver model.
No big deal here, ATA will get it too.
> PIO data transfer is faster and more scheduler-friendly, since it polls
> from a kernel thread.
CPU polling is faster than IRQs?
> And for specifically Intel SATA, drivers/ide flat out doesn't work (even
> though it claims to).
So fix it ;-).
> So, I conclude: faster, smaller, and better future direction. IMO, of
> course :)
And right now ugly and incomplete.
IMO, of course ;-).
Regards,
--
Bartlomiej
> Jeff
>
>
>
> (the following is somewhat comparing apples to oranges, but I like doing
> it nonetheless)
social engineering removed
Bartlomiej Zolnierkiewicz wrote:
> On Mon, 26 May 2003, Jeff Garzik wrote:
>
>
>>Linus Torvalds wrote:
>>
>>>On Mon, 26 May 2003, Jeff Garzik wrote:
>>>
>>>
>>>>Just to echo some comments I said in private, this driver is _not_
>>>>a replacement for drivers/ide. This is not, and has never been,
>>>>the intention. In fact, I need drivers/ide's continued existence,
>>>>so that I may have fewer boundaries on future development.
>>>
>>>
>>>Just out of interest, is there any _point_ to this driver? I can
>>>appreciate the approach, but I'd like to know if it does anything (at all)
>>>better than the native IDE driver? Faster? Anything?
>>
>>
>>Direction: SATA is much more suited to SCSI, because otherwise you wind
>>up re-creating all the queueing and error handling mess that SCSI
>>already does for you. The SATA2 host controllers coming out soon do
>>full host-side TCQ, not the dain-bramaged ATA TCQ bus-release stuff.
>>Doing SATA2 devel in drivers/ide will essentially be re-creating the
>>SCSI mid-layer.
>
>
> And now you are recreating ATA in SCSI ;-).
>
> Don't get me wrong: I like idea very much, but why you can't
> share common code between drivers/ide and your ATA-SCSI.
SATA2 is really 100% different from PATA in all respects except for the
ATA commands themselves being sent down the pipe. Doing that inside
drivers/ide really means two driver cores at that point.
Sharing code is definitely an option, though!
Why do you think I named it "libata"? :)
Another SATA wrinkle: the phy layer.
SATA, like network cards, has an underlying connection layer. The host
connects to the device. Your link state may be up or down. This is
easily mapped to SCSI semantics (initiator connection to target). Much
more code if writing from scratch.
>>Legacy-free: Because I don't have to worry about legacy host
>>controllers, I can ignore limitations drivers/ide cannot. In
>>drivers/ide, each host IO (PIO/MMIO) is done via function pointer. If
>>your arch has a mach_vec, more function pointers. Mine does direct
>>calls to the asm/io.h functions in faster. So, ATA command submission
>>is measureably faster.
>
>
> I think it is simply wrong, you should use function pointers.
> You can have ie. two PCI hosts, one using PIO and one using MMIO.
>
> "measureably faster", I doubt.
> IO operations are REALLY slow when compared to CPU cycles.
You misunderstand.
drivers/ide -- uses function pointers HWIF->outb, etc. inside a static
taskfile-submit function. The TF submit functions statically order each
HWIF->outb() call, and you cannot change that order from the low-level
driver.
my driver -- uses separate taskfile-submit functions for different hardware.
My method directly uses outb(), writel(), or whatever is necessary. It
allows one to use __writel() and barriers to optimize as needed, or to
support special cache coherency needs. Further, for SATA and some
advanced PATA controllers, my method allows one to use a completely
alien and new taskfile submission method. For example, building SATA2
FIS's as an array of u32's, and then queueing that FIS onto a DMA ring.
The existing drivers/ide code is nowhere close to that.
Putting it into drivers/ide language, my driver adds a
HWIF->do_flagged_taskfile hook, and eliminates the HWIF->outb hooks.
Turns ~13 hook calls into one, from looking at flagged_taskfile.
And FWIW, I would not state "measureably faster" unless I actually
benchmarked it. :) It's a pittance in the grand scheme of things,
since you're inevitably limited by device speed, but lower CPU usage is
something nobody complains about...
>>sysfs: James and co are putting time into getting scsi sysfs right. I
>>would rather ride their coattails, and have my driver Just Work with
>>sysfs and the driver model.
>
>
> No big deal here, ATA will get it too.
Oh agreed. But scsi is pretty much there now (it's in testing in a BK
tree), and we're darn close to 2.6.0.
>>PIO data transfer is faster and more scheduler-friendly, since it polls
>>from a kernel thread.
>
>
> CPU polling is faster than IRQs?
Yes, almost always.
The problem with polling is that it chews up CPU if you're not careful,
starving out other processes. I'm careful :)
>>And for specifically Intel SATA, drivers/ide flat out doesn't work (even
>>though it claims to).
>
>
> So fix it ;-).
As you can see from the discussion, I think a new driver best serves the
future of SATA2 as well as today's SATA. So I'm not gonna bother... I
have a driver that works.
If you're interested in working the problem, the result of booting ICH5
SATA in drivers/ide is a hang at
hdX: attached to ide-disk driver
Nothing happens after that. sysrq doesn't work, so my assumption was
that the bus was locked up.
>>So, I conclude: faster, smaller, and better future direction. IMO, of
>>course :)
>
>
> And right now ugly and incomplete.
> IMO, of course ;-).
Incomplete? Kinda sorta. SATAPI isn't here yet, so all it needs is
more fine-grained error handling. I call it 100% stable now, though.
Ugly? Compared to drivers/ide? I actually have a lot of admiration and
respect for the PATA knowledge embedded in drivers/ide. But I would
never call it pretty :)
Jeff
On Mon, 26 May 2003, Jeff Garzik wrote:
> Bartlomiej Zolnierkiewicz wrote:
> > On Mon, 26 May 2003, Jeff Garzik wrote:
> >
> >
> >>Linus Torvalds wrote:
> >>
> >>>On Mon, 26 May 2003, Jeff Garzik wrote:
> >>>
> >>>
> >>>>Just to echo some comments I said in private, this driver is _not_
> >>>>a replacement for drivers/ide. This is not, and has never been,
> >>>>the intention. In fact, I need drivers/ide's continued existence,
> >>>>so that I may have fewer boundaries on future development.
> >>>
> >>>
> >>>Just out of interest, is there any _point_ to this driver? I can
> >>>appreciate the approach, but I'd like to know if it does anything (at all)
> >>>better than the native IDE driver? Faster? Anything?
> >>
> >>
> >>Direction: SATA is much more suited to SCSI, because otherwise you wind
> >>up re-creating all the queueing and error handling mess that SCSI
> >>already does for you. The SATA2 host controllers coming out soon do
> >>full host-side TCQ, not the dain-bramaged ATA TCQ bus-release stuff.
> >>Doing SATA2 devel in drivers/ide will essentially be re-creating the
> >>SCSI mid-layer.
> >
> >
> > And now you are recreating ATA in SCSI ;-).
> >
> > Don't get me wrong: I like idea very much, but why you can't
> > share common code between drivers/ide and your ATA-SCSI.
>
> SATA2 is really 100% different from PATA in all respects except for the
> ATA commands themselves being sent down the pipe. Doing that inside
> drivers/ide really means two driver cores at that point.
>
> Sharing code is definitely an option, though!
> Why do you think I named it "libata"? :)
>
>
> Another SATA wrinkle: the phy layer.
>
> SATA, like network cards, has an underlying connection layer. The host
> connects to the device. Your link state may be up or down. This is
> easily mapped to SCSI semantics (initiator connection to target). Much
> more code if writing from scratch.
>
>
> >>Legacy-free: Because I don't have to worry about legacy host
> >>controllers, I can ignore limitations drivers/ide cannot. In
> >>drivers/ide, each host IO (PIO/MMIO) is done via function pointer. If
> >>your arch has a mach_vec, more function pointers. Mine does direct
> >>calls to the asm/io.h functions in faster. So, ATA command submission
> >>is measureably faster.
> >
> >
> > I think it is simply wrong, you should use function pointers.
> > You can have ie. two PCI hosts, one using PIO and one using MMIO.
> >
> > "measureably faster", I doubt.
> > IO operations are REALLY slow when compared to CPU cycles.
>
> You misunderstand.
>
> drivers/ide -- uses function pointers HWIF->outb, etc. inside a static
> taskfile-submit function. The TF submit functions statically order each
> HWIF->outb() call, and you cannot change that order from the low-level
> driver.
>
> my driver -- uses separate taskfile-submit functions for different hardware.
>
> My method directly uses outb(), writel(), or whatever is necessary. It
> allows one to use __writel() and barriers to optimize as needed, or to
> support special cache coherency needs. Further, for SATA and some
> advanced PATA controllers, my method allows one to use a completely
> alien and new taskfile submission method. For example, building SATA2
> FIS's as an array of u32's, and then queueing that FIS onto a DMA ring.
> The existing drivers/ide code is nowhere close to that.
Its not that hard, but nevermind.
> Putting it into drivers/ide language, my driver adds a
> HWIF->do_flagged_taskfile hook, and eliminates the HWIF->outb hooks.
> Turns ~13 hook calls into one, from looking at flagged_taskfile.
Okay, I misunderstood.
This is cool, but you are using ie. outb not only in taskfiles.
> And FWIW, I would not state "measureably faster" unless I actually
> benchmarked it. :) It's a pittance in the grand scheme of things,
> since you're inevitably limited by device speed, but lower CPU usage is
> something nobody complains about...
>
>
> >>sysfs: James and co are putting time into getting scsi sysfs right. I
> >>would rather ride their coattails, and have my driver Just Work with
> >>sysfs and the driver model.
> >
> >
> > No big deal here, ATA will get it too.
>
> Oh agreed. But scsi is pretty much there now (it's in testing in a BK
> tree), and we're darn close to 2.6.0.
>
>
> >>PIO data transfer is faster and more scheduler-friendly, since it polls
> >>from a kernel thread.
> >
> >
> > CPU polling is faster than IRQs?
>
> Yes, almost always.
>
> The problem with polling is that it chews up CPU if you're not careful,
> starving out other processes. I'm careful :)
Good.
> >>And for specifically Intel SATA, drivers/ide flat out doesn't work (even
> >>though it claims to).
> >
> >
> > So fix it ;-).
>
> As you can see from the discussion, I think a new driver best serves the
> future of SATA2 as well as today's SATA. So I'm not gonna bother... I
> have a driver that works.
>
> If you're interested in working the problem, the result of booting ICH5
> SATA in drivers/ide is a hang at
> hdX: attached to ide-disk driver
>
> Nothing happens after that. sysrq doesn't work, so my assumption was
> that the bus was locked up.
>
>
> >>So, I conclude: faster, smaller, and better future direction. IMO, of
> >>course :)
> >
> >
> > And right now ugly and incomplete.
> > IMO, of course ;-).
>
> Incomplete? Kinda sorta. SATAPI isn't here yet, so all it needs is
> more fine-grained error handling. I call it 100% stable now, though.
>
> Ugly? Compared to drivers/ide? I actually have a lot of admiration and
> respect for the PATA knowledge embedded in drivers/ide. But I would
> never call it pretty :)
Silly excuse ;-).
Thanks,
--
Bartlomiej
> Jeff
On Mon, 26 May 2003, Jeff Garzik wrote:
>
> Tangenting a bit:
Not at all. This is the thrust of _my_ worries.
> What does the block layer need, that it doesn't have now?
Exactly. I'd _love_ for people to really think about this.
Linus
On Mon, 26 May 2003, Jeff Garzik wrote:
> >>I think with SATA + drivers/ide, you reach a point of diminishing
> >>returns versus amount of time spent on mid-layer coding.
> >
> > I think that's a valid approach, and just have a special driver for SATA.
> > That's not the part I worry about.
>
> You're still at a point of diminishing returns for a native driver too.
No, because for a block-queue native driver there would be one huge
advantage: making sure the native queueing is brought up to be a fully
used interface.
Originally, native queueing was all there was, and all drivers interfaced
directly to it. The IDE and SCSI layers ended up being used largely
because they allowed drivers to use a global naming scheme, not because
they were "better". In fact, in many ways they were a lot worse than the
native queues, but then they had some other infrastructure that did help.
My bet is that most of that infrastructure today is just not worth it.
> Userland support is nonexistent, and users would have start using yet
> another set of tools to deal with their disks, instead of the existing
> [scsi] ones.
That's not true. The command set is already exported, so that even things
like packet commands can be sent (that's how you write CD-RW under 2.5.x,
and it's the _generic_ layer that does all the command queueing).
That, btw, i show you should do the ATA transport thing. Even SCSI devices
can get the commands fed that way. It's neither ATA nor SCSI, it's a
"packet interface for the generic queue".
The only remaining piece of the puzzle (from a user land perspective) ends
up being the mapping from major/minor -> queue. That used to be a _huge_
issue, and the main reason you couldn't write a SCSI driver outside the
SCSI layer, for example (but look at DAC960 - it decided to do so
_anyway_, even if it meant being ostracised and put on a separate major).
But that should be much less of an issue now, since our queue mapping is a
lot more flexible these days.
> Device model, power management, hotplugging are all handled or
> in-the-works for scsi, and they are exactly what SATA needs.
Clue-in-time: if hotplugging doesn't work on a native level, it won't work
on a SCSI level either. So that _cannot_ be a real issue.
Device model stuff and power management all comes from other areas, and
have little to do with SCSI.
Linus
On Mon, May 26 2003, Linus Torvalds wrote:
> > What does the block layer need, that it doesn't have now?
>
> Exactly. I'd _love_ for people to really think about this.
In discussion with Jeff, it seems most of what he wants is already
there. He just doesn't know it yet :-)
Maybe that's my problem as well, maybe the code / comments / doc /
whatever is not clear enough.
--
Jens Axboe
Linus Torvalds wrote:
> On Mon, 26 May 2003, Jeff Garzik wrote:
>
>>>>I think with SATA + drivers/ide, you reach a point of diminishing
>>>>returns versus amount of time spent on mid-layer coding.
>>>
>>>I think that's a valid approach, and just have a special driver for SATA.
>>>That's not the part I worry about.
>>
>>You're still at a point of diminishing returns for a native driver too.
>
>
> No, because for a block-queue native driver there would be one huge
> advantage: making sure the native queueing is brought up to be a fully
> used interface.
>
> Originally, native queueing was all there was, and all drivers interfaced
> directly to it. The IDE and SCSI layers ended up being used largely
> because they allowed drivers to use a global naming scheme, not because
> they were "better". In fact, in many ways they were a lot worse than the
> native queues, but then they had some other infrastructure that did help.
>
> My bet is that most of that infrastructure today is just not worth it.
[...]
> Clue-in-time: if hotplugging doesn't work on a native level, it won't work
> on a SCSI level either. So that _cannot_ be a real issue.
>
> Device model stuff and power management all comes from other areas, and
> have little to do with SCSI.
Correct, but precisely by saying that, you're missing something.
The SCSI midlayer provides infrastructure I need -- which is not
specific to SCSI at all.
Device registration, host registration, major/minor sub-allocation,
device model, power management, command queueing features, command
timeout, error handling thread, userland command submission interface (a
la taskfile ioctl or SG_IO)... all these are possible via the generic
block layer -- but the generic block layer does not directly provide
these facilities. Each subsystem (ide, scsi, ...) has to provide bits
on top of that to make it usable. Each native block driver must
re-create these for its own subsystem.
All those "little bits" add up to code I have to write, for a native
driver, which I don't have to write for SCSI.
IF you are willing to sign off on a generic /dev/disk (bus agnostic),
then I am interested. That allows me to code up the above stuff ONCE,
and not see others do the same work when they write a native blk driver
(or avoid this work, by using SCSI like I do now). Otherwise. I'm
writing all those little bits YET AGAIN, duplicating stuff scsi already
does.
Taking that to a concrete level, that basically means more helpers at
the native block level.
Given that I have always agreed a native block driver is the best
eventual method, what's stopping me? Time:
If /dev/disk is a non-starter for 2.6, the best route is to take my SCSI
driver and evolve it over time to be a native block driver. Not start
with a native block driver. That gives us a SATA framework now, and
SCSI can disappear eventually.
#include <linux_is_usable_not_pie_in_sky_theory.h>
(re-arranged msg a bit)
>> Userland support is nonexistent, and users would have start using yet
>> another set of tools to deal with their disks, instead of the existing
>> [scsi] ones.
> That's not true. The command set is already exported, so that even things
> like packet commands can be sent (that's how you write CD-RW under 2.5.x,
> and it's the _generic_ layer that does all the command queueing).
>
> That, btw, i show you should do the ATA transport thing. Even SCSI devices
> can get the commands fed that way. It's neither ATA nor SCSI, it's a
> "packet interface for the generic queue".
I agree the capabilities are there for packet interface, but one still
has to define the userland interface. Yes, I know about SG_IO. It's:
somewhat SCSI-specific, and sub-optimal. Think back to December 2001,
when you described a packet interface using read(2) and write(2).
Basically SG_IO, but better and without scsi specifics.
> The only remaining piece of the puzzle (from a user land perspective) ends
> up being the mapping from major/minor -> queue. That used to be a _huge_
> issue, and the main reason you couldn't write a SCSI driver outside the
> SCSI layer, for example (but look at DAC960 - it decided to do so
> _anyway_, even if it meant being ostracised and put on a separate major).
>
> But that should be much less of an issue now, since our queue mapping is a
> lot more flexible these days.
The mapping part is easy, IMO. I would just have the "generic
infrastructure" export chrdrvs for each queue I manage, using the
interface you described in December 2001. sysfs can then describe the
{major+minor | CIDR | whatever}-to-queue mapping.
For example, when /dev/disk device registration occurs at runtime, the
/dev/disk infrastructure would make a new chrdrv appear.
Anyway, stepping back to the larger issue of "userland tools", I'm not
just talking about a taskfile I/O interface. I'm talking about all the
little distro tools for IDE and SCSI, like /usr/bin/scsi_unique_id or
/sbin/hdparm. And the assumptions that distro installers make.
Creating yet another set of these tools and standard practices is just
not worth it, without bus-agnostic /dev/disk. Sysadmins are going to be
annoyed at yet another interface and yet another not-IDE, not-SCSI (and
thus second-class) block major.
It's not worth it for me, with working code now, and it's not worth for
the next person who wants to add a bus type to Linux.
Unless I am to be creating infrastructure which non-SATA people can use,
your suggestion equates to making the SATA driver a second-class
citizen, with lots of additional coding for little gain, and with lots
of pain on the distro side that must be repeated again and again for
each native block driver.
Jeff
Jens Axboe wrote:
> On Mon, May 26 2003, Linus Torvalds wrote:
>
>>>What does the block layer need, that it doesn't have now?
>>
>>Exactly. I'd _love_ for people to really think about this.
>
>
> In discussion with Jeff, it seems most of what he wants is already
> there. He just doesn't know it yet :-)
hehe. Here's a salient point:
native block drivers are typically used in one of two ways: creating a
whole subsystem (ide, scsi), or servicing a single host (dac960).
I am doing the first: creating a whole subsystem. The infrastructure
involved in that, over and above what block already provides, is that
part I dread coding.
If I am to code that, I want to do so ONCE. And that means a
bus-agnostic /dev/{disk,cdrom,tape}. Not /dev/{ad,acd,at,ag}.
Jeff
Jens Axboe wrote:
> On Mon, May 26 2003, Linus Torvalds wrote:
>
>>>What does the block layer need, that it doesn't have now?
>>
>>Exactly. I'd _love_ for people to really think about this.
>
>
> In discussion with Jeff, it seems most of what he wants is already
> there. He just doesn't know it yet :-)
Another important point is time.
I continue to agree that a native block driver is the best direction.
But with 2.6.0 looming, I think it's best to evolve my ATA driver to be
a native block driver from a scsi one. Not start out as a native
driver. That's significant pre-2.6 churn.
Or, it lives out-of-tree until 2.7 and people with SATA hardware have to
go out-of-tree for their driver for months and months, until the working
driver is deemed sufficiently native :) In the meantime, distros
wanting working SATA will just ship the SCSI driver as-is. :(
Jeff
On Mon, May 26 2003, Jeff Garzik wrote:
> Jens Axboe wrote:
> >On Mon, May 26 2003, Linus Torvalds wrote:
> >
> >>>What does the block layer need, that it doesn't have now?
> >>
> >>Exactly. I'd _love_ for people to really think about this.
> >
> >
> >In discussion with Jeff, it seems most of what he wants is already
> >there. He just doesn't know it yet :-)
>
>
> Another important point is time.
>
> I continue to agree that a native block driver is the best direction.
>
> But with 2.6.0 looming, I think it's best to evolve my ATA driver to be
> a native block driver from a scsi one. Not start out as a native
> driver. That's significant pre-2.6 churn.
I don't think that makes any sense. If you really do find missing
functionality that are candidates to be generic block property, we can
add them.
> Or, it lives out-of-tree until 2.7 and people with SATA hardware have to
> go out-of-tree for their driver for months and months, until the working
> driver is deemed sufficiently native :) In the meantime, distros
> wanting working SATA will just ship the SCSI driver as-is. :(
I don't know why you are even worrying about this yet, time will decide
what happens. As it stands right now, I still consider your driver to be
something to play with, not something that should go into the kernel
anytime soon.
Maybe in 6 months time it has evolved to be something really great, we
add it then. Right now you are still in the design stages in some areas.
--
Jens Axboe
On Mon, May 26 2003, Linus Torvalds wrote:
> > What does the block layer need, that it doesn't have now?
>
> Exactly. I'd _love_ for people to really think about this.
In discussion with Jeff, it seems most of what he wants is already
there. He just doesn't know it yet :-)
Maybe that's my problem as well, maybe the code / comments / doc /
whatever is not clear enough.
My wishlist for this would be:
1. Unified SG segment allocation. The SCSI layer currently has a
mempool implementation to cope with this, is there a reason it can't
become block generic?
2. Device locality awareness. Quite a bit of the esoteric SCSI queueing
code occurs because we have two type of queue events:
a. device can't accept another command---stop queue and restart when the
device sends a completion back
b. the host adapter is out of resources for *all* its devices. Block
all device queues until we free some resources (again, usually a
returning command).
3. Perhaps some type of unified command handling. At the moment, we all
seem to allocate DMA'able regions for our commands/taskfiles/whatever
and attach them to reqest->special. Then we need to release them again
before completing the request.
4. Same thing goes for sense buffers.
5. There needs to be some amalgam of the SCSI code for dynamic tag
command queue depth handling.
OK, I'll stop now.
James
On Mon, May 26 2003, James Bottomley wrote:
>
> On Mon, May 26 2003, Linus Torvalds wrote:
> > > What does the block layer need, that it doesn't have now?
> >
> > Exactly. I'd _love_ for people to really think about this.
>
> In discussion with Jeff, it seems most of what he wants is already
> there. He just doesn't know it yet :-)
>
> Maybe that's my problem as well, maybe the code / comments / doc /
> whatever is not clear enough.
>
> My wishlist for this would be:
>
> 1. Unified SG segment allocation. The SCSI layer currently has a
> mempool implementation to cope with this, is there a reason it can't
> become block generic?
Of course that is doable, when I killed scsi_dma.c it was just a direct
replacement. Given that IDE had no such dynamic sg list allocation
requirements, it stayed in SCSI. Overdesign is never good :)
> 2. Device locality awareness. Quite a bit of the esoteric SCSI queueing
> code occurs because we have two type of queue events:
> a. device can't accept another command---stop queue and restart when the
> device sends a completion back
This should be doable.
> b. the host adapter is out of resources for *all* its devices. Block
> all device queues until we free some resources (again, usually a
> returning command).
This is harder, because it involves more than one specific queue.
> 3. Perhaps some type of unified command handling. At the moment, we all
> seem to allocate DMA'able regions for our commands/taskfiles/whatever
> and attach them to reqest->special. Then we need to release them again
> before completing the request.
>
> 4. Same thing goes for sense buffers.
Completely agree.
> 5. There needs to be some amalgam of the SCSI code for dynamic tag
> command queue depth handling.
Again, block layer queueing was designed for what I needed (ide tcq) and
no overdesign was attempted. If you describe what you need, I'd be very
happy to oblige and add those bits. Some decent depth change handling, I
presume?
--
Jens Axboe
On Mon, 2003-05-26 at 14:18, Jens Axboe wrote:
> > 1. Unified SG segment allocation. The SCSI layer currently has a
> > mempool implementation to cope with this, is there a reason it can't
> > become block generic?
>
> Of course that is doable, when I killed scsi_dma.c it was just a direct
> replacement. Given that IDE had no such dynamic sg list allocation
> requirements, it stayed in SCSI. Overdesign is never good :)
I agree with the sentiment. I just don't think variable size SG tables
will remain the exclusive province of SCSI forever.
> > b. the host adapter is out of resources for *all* its devices. Block
> > all device queues until we free some resources (again, usually a
> > returning command).
>
> This is harder, because it involves more than one specific queue.
Yes, this is our nastycase, especially for locking and ref
counting...you didn't say I only had to hand off the easy problems,
though...
Hotpluggin has to have some awareness of this locality too. Even for
IDE, hot unplug a card and you can lose two devices per cable.
> > 5. There needs to be some amalgam of the SCSI code for dynamic tag
> > command queue depth handling.
>
> Again, block layer queueing was designed for what I needed (ide tcq) and
> no overdesign was attempted. If you describe what you need, I'd be very
> happy to oblige and add those bits. Some decent depth change handling, I
> presume?
Pretty much yes, now. We lost all of our memory allocation nightmare
problems when we moved away from fixed command queues per device to lazy
command allocation using slabs.
James
On Mon, May 26 2003, James Bottomley wrote:
> On Mon, 2003-05-26 at 14:18, Jens Axboe wrote:
> > > 1. Unified SG segment allocation. The SCSI layer currently has a
> > > mempool implementation to cope with this, is there a reason it can't
> > > become block generic?
> >
> > Of course that is doable, when I killed scsi_dma.c it was just a direct
> > replacement. Given that IDE had no such dynamic sg list allocation
> > requirements, it stayed in SCSI. Overdesign is never good :)
>
> I agree with the sentiment. I just don't think variable size SG tables
> will remain the exclusive province of SCSI forever.
Agree, until that becomes a problem I don't see a reason to rip it out
of SCSI.
> > > b. the host adapter is out of resources for *all* its devices. Block
> > > all device queues until we free some resources (again, usually a
> > > returning command).
> >
> > This is harder, because it involves more than one specific queue.
>
> Yes, this is our nastycase, especially for locking and ref
> counting...you didn't say I only had to hand off the easy problems,
> though...
:)
I really think this should be handled in the SCSI layer. You are dealing
with devices hanging off a hardware unit of some sort.
It's also possible to go too far with this whole abstraction deal. The
resulting code ends up being contorted, instead of having to separate
'straight forward' cases in SCSI and XYZ (for instance).
> Hotpluggin has to have some awareness of this locality too. Even for
> IDE, hot unplug a card and you can lose two devices per cable.
Hmmm yes. Now we are moving into general device management, though.
> > > 5. There needs to be some amalgam of the SCSI code for dynamic tag
> > > command queue depth handling.
> >
> > Again, block layer queueing was designed for what I needed (ide tcq) and
> > no overdesign was attempted. If you describe what you need, I'd be very
> > happy to oblige and add those bits. Some decent depth change handling, I
> > presume?
>
> Pretty much yes, now. We lost all of our memory allocation nightmare
> problems when we moved away from fixed command queues per device to lazy
> command allocation using slabs.
Alright, so what do you need? Start out with X tags, shrink to Y (based
on repeated queue full conditions)? Anything else?
--
Jens Axboe
On Mon, 2003-05-26 at 15:07, Jens Axboe wrote:
> Alright, so what do you need? Start out with X tags, shrink to Y (based
> on repeated queue full conditions)? Anything else?
Actually, it's easier than that: just an API to alter the number of tags
in the block layer (really only the size of your internal hash table).
The actual heuristics of when to alter the queue depth is the province
of the individual drivers (although Doug Ledford was going to come up
with a generic implementation).
James
On Mon, May 26 2003, James Bottomley wrote:
> On Mon, 2003-05-26 at 15:07, Jens Axboe wrote:
> > Alright, so what do you need? Start out with X tags, shrink to Y (based
> > on repeated queue full conditions)? Anything else?
>
> Actually, it's easier than that: just an API to alter the number of tags
> in the block layer (really only the size of your internal hash table).
> The actual heuristics of when to alter the queue depth is the province
> of the individual drivers (although Doug Ledford was going to come up
> with a generic implementation).
That's actually what I meant, that the SCSI layer would call down into
the block layer to set the size. I don't/want to know about queue full
conditions.
The internal memory requirements for the queue table is small (a bit per
tag), so I think we can basically get away with just decrementing
->max_depth.
--
Jens Axboe
On Mon, 26 May 2003, Jeff Garzik wrote:
>
> Correct, but precisely by saying that, you're missing something.
You're missing _my_ point.
> The SCSI midlayer provides infrastructure I need -- which is not
> specific to SCSI at all.
If it isn't specific to SCSI, then it sure as hell shouldn't BE THERE!
My point is that it's _wrong_ to make non-SCSI drivers use the SCSI layer,
because that shows that something is misdesigned.
And I bet there isn't all that much left that really helps.
You adding more "pseudo-SCSI" crap just _makes_things_worse. It does not
advance anything, it regresses.
Linus
On 26 May 2003, James Bottomley wrote:
>
> On Mon, 2003-05-26 at 15:07, Jens Axboe wrote:
> > Alright, so what do you need? Start out with X tags, shrink to Y (based
> > on repeated queue full conditions)? Anything else?
>
> Actually, it's easier than that: just an API to alter the number of tags
> in the block layer (really only the size of your internal hash table).
> The actual heuristics of when to alter the queue depth is the province
> of the individual drivers (although Doug Ledford was going to come up
> with a generic implementation).
Talking about tagged queueing - does the SCSI layer still remove the
request from the request list when it starts executing it?
At least historically that's a major mistake, and generates a crappy
elevator, because it removes information from the block layer about where
the disk is (or is going to be).
I know Andrew thinks that SCSI tagged queuing is a bunch of crap, and he
has the latency numbers to prove it. He blames the SCSI disks themselves,
but I think it might be the fact that SCSI makes it impossible to make a
fair queuing algorithm for higher levels by hiding information.
Has anybody looked at just removing the request at command _completion_
time instead? That's what IDE does, and it's the _right_ thing to do.
I'd hate for SATA to pick up these kinds of mistakes from the SCSI layer.
Linus
On Mon, 2003-05-26 at 16:27, Linus Torvalds wrote:
> Talking about tagged queueing - does the SCSI layer still remove the
> request from the request list when it starts executing it?
That's a block layer requirement (blk_queue_start_tag does the dequeue).
But yes, for untagged requests, we still dequeue them manually.
> At least historically that's a major mistake, and generates a crappy
> elevator, because it removes information from the block layer about where
> the disk is (or is going to be).
>
> I know Andrew thinks that SCSI tagged queuing is a bunch of crap, and he
> has the latency numbers to prove it. He blames the SCSI disks themselves,
> but I think it might be the fact that SCSI makes it impossible to make a
> fair queuing algorithm for higher levels by hiding information.
>
> Has anybody looked at just removing the request at command _completion_
> time instead? That's what IDE does, and it's the _right_ thing to do.
Well...I could do it, but since it only applies to untagged devices
(which is really tapes and some CD-ROMs nowadays), I'm not sure it would
be worth the effort.
> I'd hate for SATA to pick up these kinds of mistakes from the SCSI layer.
The elevator is based on linear head movements. I'm not sure its
optimal to figure out a way to leave all the executing tagged requests
in the queue, is it? They can be spread out all over the disc with no
idea which one the disc is currently executing.
James
On Mon, May 26 2003, Linus Torvalds wrote:
>
> On 26 May 2003, James Bottomley wrote:
> >
> > On Mon, 2003-05-26 at 15:07, Jens Axboe wrote:
> > > Alright, so what do you need? Start out with X tags, shrink to Y (based
> > > on repeated queue full conditions)? Anything else?
> >
> > Actually, it's easier than that: just an API to alter the number of tags
> > in the block layer (really only the size of your internal hash table).
> > The actual heuristics of when to alter the queue depth is the province
> > of the individual drivers (although Doug Ledford was going to come up
> > with a generic implementation).
>
> Talking about tagged queueing - does the SCSI layer still remove the
> request from the request list when it starts executing it?
Yes
> At least historically that's a major mistake, and generates a crappy
> elevator, because it removes information from the block layer about where
> the disk is (or is going to be).
Not necessarily, the io schedulers keep track of where we are going. You
don't need an active front request for that.
> I know Andrew thinks that SCSI tagged queuing is a bunch of crap, and he
> has the latency numbers to prove it. He blames the SCSI disks themselves,
> but I think it might be the fact that SCSI makes it impossible to make a
> fair queuing algorithm for higher levels by hiding information.
>
> Has anybody looked at just removing the request at command _completion_
> time instead? That's what IDE does, and it's the _right_ thing to do.
I know this is a pet peeve of yours (must be, I remember you bringing it
up at least 3 time before :), but I don't think that's necessarily true.
It shouldn't matter _one_ bit whether you leave the request there or
not, it's unmergeable. As long as the io scheduler keeps track of this
(and it does!) we are golden.
--
Jens Axboe
On 26 May 2003, James Bottomley wrote:
>
> > I'd hate for SATA to pick up these kinds of mistakes from the SCSI layer.
>
> The elevator is based on linear head movements.
Historically, yes.
But we've been moving more and more towards a latency-based elevator, or
at least one that takes latency into account. Which is exactly why you'd
like to leave unfinished requests on the queue, because otherwise your
queue doesn't really show what is really going on.
In particular, while the higher layers don't actually _do_ this yet, from
a latency standpoint the right thing to do when some request seems to get
starved is to refuse to feed more tagged requeusts to the device until the
starved request has finished.
As I mentioned, Andrew actually had some really bad latency numbers to
prove that this is a real issue. SCSI with more than 4 tags or so results
in potentially _major_ starvation, because the disks themselves are not
even trying to avoid it.
Also, even aside from the starvation issue with unfair disks, just from a
"linear head movement" standpoint, you really want to sort the queue
according to what is going on _now_ in the disk. If the disk eats the
requests immediately (but doesn't execute them immediately), the sorting
has nothing to go on, and you get tons of back-and-forth movements.
Basically, if you're trying to do an elevator, you need to know what the
outstanding commands are. Think it through on paper, and you'll see what I
mean.
Linus
On Mon, 26 May 2003, Jens Axboe wrote:
>
> I know this is a pet peeve of yours (must be, I remember you bringing it
> up at least 3 time before :), but I don't think that's necessarily true.
> It shouldn't matter _one_ bit whether you leave the request there or
> not, it's unmergeable.
It's not the merging that I worry about. It's latency and starvation.
Think of it this way: if you keep feeding a disk requests, and the disk
always tries to do the closest one (which is a likely algorithm), you can
easily have a situation where the disk _never_ actually schedules a
request that is at one "end" of the platter.
Think of all the fairness issues we've had in the elevator code, and
realize that the low-level disk probably implements _none_ of those
fairness algorithms.
> As long as the io scheduler keeps track of this (and it does!) we are
> golden.
Hmm.. Where does it keep track of request latency for requests that have
been removed from the queue?
Linus
On Mon, May 26 2003, Linus Torvalds wrote:
>
> On 26 May 2003, James Bottomley wrote:
> >
> > > I'd hate for SATA to pick up these kinds of mistakes from the SCSI layer.
> >
> > The elevator is based on linear head movements.
>
> Historically, yes.
>
> But we've been moving more and more towards a latency-based elevator, or
> at least one that takes latency into account. Which is exactly why you'd
> like to leave unfinished requests on the queue, because otherwise your
> queue doesn't really show what is really going on.
The newer io schedulers divide the queue into a front and backend, so
it's not exactly trivial to just 'leave it on the queue'. Which queue?
You know which ones are pending, they are on the busy queue. You can
look there.
> In particular, while the higher layers don't actually _do_ this yet, from
> a latency standpoint the right thing to do when some request seems to get
> starved is to refuse to feed more tagged requeusts to the device until the
> starved request has finished.
Agree
> As I mentioned, Andrew actually had some really bad latency numbers to
> prove that this is a real issue. SCSI with more than 4 tags or so results
> in potentially _major_ starvation, because the disks themselves are not
> even trying to avoid it.
Basically everyone agrees that this shouldn't happen in newer disks, I'm
inclined to think we are seeing internal queueing bugs in the SCSI
drivers or SCSI layer itself.
> Also, even aside from the starvation issue with unfair disks, just from a
> "linear head movement" standpoint, you really want to sort the queue
> according to what is going on _now_ in the disk. If the disk eats the
> requests immediately (but doesn't execute them immediately), the sorting
> has nothing to go on, and you get tons of back-and-forth movements.
>
> Basically, if you're trying to do an elevator, you need to know what the
> outstanding commands are. Think it through on paper, and you'll see what I
> mean.
Who said we are using an elevator? News flash, we haven't used an
elevator design for a long time.
--
Jens Axboe
On Mon, 2003-05-26 at 16:45, Linus Torvalds wrote:
>
> On 26 May 2003, James Bottomley wrote:
> >
> > > I'd hate for SATA to pick up these kinds of mistakes from the SCSI layer.
> >
> > The elevator is based on linear head movements.
>
> Historically, yes.
>
> But we've been moving more and more towards a latency-based elevator, or
> at least one that takes latency into account. Which is exactly why you'd
> like to leave unfinished requests on the queue, because otherwise your
> queue doesn't really show what is really going on.
>
> In particular, while the higher layers don't actually _do_ this yet, from
> a latency standpoint the right thing to do when some request seems to get
> starved is to refuse to feed more tagged requeusts to the device until the
> starved request has finished.
OK, number seven on the list was going to be thinking about tracking
timeouts at the block layer.
Tag starvation handling is a property of each individual SCSI HBA (and I
know it shouldn't be). Some do a good job, some don't. However, it's
implicitly also handled in the SCSI error handler because on timeout we
quiesce the device (which finally causes the starved tag to execute). I
have thought about doing it properly in the error handler, but it's a
radical divergence---you basically need to begin to throttle the queue,
but wait to see if your starved tag gets serviced and then increase the
feed again (as opposed to our current quiesce then handle model).
> As I mentioned, Andrew actually had some really bad latency numbers to
> prove that this is a real issue. SCSI with more than 4 tags or so results
> in potentially _major_ starvation, because the disks themselves are not
> even trying to avoid it.
Tag starvation isn't really a problem for the majority of modern SCSI
drives (big iron vendors yelled and kicked about this a while ago). You
can still find some of the older drives that do have these issues (I
keep several for the purposes of exciting the error handling).
> Also, even aside from the starvation issue with unfair disks, just from a
> "linear head movement" standpoint, you really want to sort the queue
> according to what is going on _now_ in the disk. If the disk eats the
> requests immediately (but doesn't execute them immediately), the sorting
> has nothing to go on, and you get tons of back-and-forth movements.
Most scsi discs have a non linear geometry anyway, so "head scheduling"
is pretty useless to them. All SCSI really wants is I/O consolidation.
i.e. we'd like to use fewer larger sized requests.
James
On Mon, May 26 2003, Linus Torvalds wrote:
>
> On Mon, 26 May 2003, Jens Axboe wrote:
> >
> > I know this is a pet peeve of yours (must be, I remember you bringing it
> > up at least 3 time before :), but I don't think that's necessarily true.
> > It shouldn't matter _one_ bit whether you leave the request there or
> > not, it's unmergeable.
>
> It's not the merging that I worry about. It's latency and starvation.
>
> Think of it this way: if you keep feeding a disk requests, and the disk
> always tries to do the closest one (which is a likely algorithm), you can
> easily have a situation where the disk _never_ actually schedules a
> request that is at one "end" of the platter.
Then you have a bad disk, period. If the disks always tries to
approximate SPTF internally, then it's a bad design. Apparently that
Other OS times read/write requests out after 3 seconds, we we at least
know we are getting service in that time frame. Not saying that is good
enough, just a data point.
But the situation you describe above can easily be fixed, you described
the solution yourself in the previous mail. The silly tag depth is a
problem in itself, it should not be done. Keeping a sane number of tags
just to keep the disk busy, and we can use the "don't queue more
requests before X finishes, because X has been waiting for Y ms" tactic.
In fact, considering folks want to make error handling for command
timeouts a block property (that I agree with, we are already going there
with the SG_IO stuff), we can soft timeout a command if need be and
handle the case from there. What do you think?
> Think of all the fairness issues we've had in the elevator code, and
> realize that the low-level disk probably implements _none_ of those
> fairness algorithms.
I think it does, to some extent at least.
> > As long as the io scheduler keeps track of this (and it does!) we are
> > golden.
>
> Hmm.. Where does it keep track of request latency for requests that have
> been removed from the queue?
Well, it doesn't...
--
Jens Axboe
On Mon, 26 May 2003, Jens Axboe wrote:
>
> > Think of all the fairness issues we've had in the elevator code, and
> > realize that the low-level disk probably implements _none_ of those
> > fairness algorithms.
>
> I think it does, to some extent at least.
I doubt they do a very good job of it. I know of bad cases, even with
"high-end" hardware. Sure, we can hope that it's getting better, but do we
want to bet on it.
> > Hmm.. Where does it keep track of request latency for requests that have
> > been removed from the queue?
>
> Well, it doesn't...
Yeah. Which means that right now _really_ long starvation will show up as
timeouts, while other cases will just show up as bad latency.
Which will _work_, of course (assuming the timeout handling is correct,
which is a big if in itself), but it still sucks from a usability
standpoint.
Even if we drop our timeouts from 30 seconds (or whatever they are now)
down to just a few seconds, that's a _loooong_ time, and we should be a
lot more proactive about things. Audio/video stuff tends to want things
with latencies in the tenth-of-a-second range, even when they buffer
things up internally to hide the worst cases.
Linus
Linus Torvalds wrote:
>On Mon, 26 May 2003, Jens Axboe wrote:
>
>>>Think of all the fairness issues we've had in the elevator code, and
>>>realize that the low-level disk probably implements _none_ of those
>>>fairness algorithms.
>>>
>>I think it does, to some extent at least.
>>
>
>I doubt they do a very good job of it. I know of bad cases, even with
>"high-end" hardware. Sure, we can hope that it's getting better, but do we
>want to bet on it.
>
>
>>>Hmm.. Where does it keep track of request latency for requests that have
>>>been removed from the queue?
>>>
>>Well, it doesn't...
>>
>
>Yeah. Which means that right now _really_ long starvation will show up as
>timeouts, while other cases will just show up as bad latency.
>
There is an elevator notifier which is called on request
completion in Andrew's tree (needed for AS io scheduler). This
can be used to do what you want.
On Tue, 27 May 2003, Nick Piggin wrote:
>
> There is an elevator notifier which is called on request
> completion in Andrew's tree (needed for AS io scheduler). This
> can be used to do what you want.
Well, yeah, sure, you can use it to keep track of outstanding requests.
But wouldn't it be nicer to see them in the first place?
The question being: how do you sanely avoid adding more requests to the
queue if you start seeing starvation? Or if adding more requests, at least
using an ordered tag or a barrier or whatever to make sure that you tell
the disk that the new request must not be done before the old one has
finally been satisfied?
I think you'd want to see the old requests in order to be able to make
that decision reasonably well.
This is clearly not a 2.6.x issue, btw.
Linus
Linus Torvalds wrote:
>On Tue, 27 May 2003, Nick Piggin wrote:
>
>>There is an elevator notifier which is called on request
>>completion in Andrew's tree (needed for AS io scheduler). This
>>can be used to do what you want.
>>
>
>Well, yeah, sure, you can use it to keep track of outstanding requests.
>But wouldn't it be nicer to see them in the first place?
>
Yeah. Basically the driver will call:
elv_next_request, elv_remove_request, elv_completed_request
elv_remove_request can easily just move the request to another
list, which is removed by elv_completed_request.
Don't let the names bother you, the elevator (in Andrew's tree)
gets all the information it needs.
On Llu, 2003-05-26 at 22:34, Linus Torvalds wrote:
> Even if we drop our timeouts from 30 seconds (or whatever they are now)
> down to just a few seconds, that's a _loooong_ time, and we should be a
> lot more proactive about things. Audio/video stuff tends to want things
> with latencies in the tenth-of-a-second range, even when they buffer
> things up internally to hide the worst cases.
Many IDE drives will take several seconds to give up on a failing I/O
because they do all the recovery themselves. IDE CD/DVD can be
especially bad for this.
On Llu, 2003-05-26 at 06:40, Linus Torvalds wrote:
> > And for specifically Intel SATA, drivers/ide flat out doesn't work (even
> > though it claims to).
>
> Well, I don't think it claimed to, until today. Still doesn't work?
Even if it did it would at best be a toy. The core IDE layer doesn't
handle SCSI errors properly (needed for ATAPI) except using ide-scsi. It
doesn't handle hot plugging of devices, it doesn't handle tagged
queueing very well, it hasn't the slightest idea about multipath (SATA2
can do), it doesn't know a lot of other things either.
SATA and especially SATA2 is basically SCSI with some slightly odd ways
of issuing READ10/WRITE10 to disk devices. A "native" driver would
basically be a copy of most of drivers/scsi.
I actually think thats a positive thing. It means 2.5 drivers/scsi is
now very close to being the "native queueing driver" with some
additional default plugins for doing scsi scanning, scsi error recovery
and a few other scsi bits.
On Llu, 2003-05-26 at 21:09, Linus Torvalds wrote:
> My point is that it's _wrong_ to make non-SCSI drivers use the SCSI layer,
> because that shows that something is misdesigned.
Sure - there is lots of stuff that ought to be generic error
handling/timers/requeue/ioctl stuff that right now happens to be in the
scsi layer, and a megaton of user space that only supports it
> You adding more "pseudo-SCSI" crap just _makes_things_worse. It does not
> advance anything, it regresses.
SATA is SCSI with some legacy crap nailed on the back end, next
generation SATA controllers look like scsi, act like scsi, think like
scsi. The bang a register count to ten poke a couple of bits and babysit
the command world of IDE is dying.
Also with regards to the confusion comment, several vendors now have
identical hardware interfaces for their smart SATA and SCSI devices.
dpt_i2o has an ATA form, aacraid has a SATA form so the confusion
problem about device names also won't be going away in a hurry.
On 27 May 2003, Alan Cox wrote:
>
> I actually think thats a positive thing. It means 2.5 drivers/scsi is
> now very close to being the "native queueing driver" with some
> additional default plugins for doing scsi scanning, scsi error recovery
> and a few other scsi bits.
Hey, that may well be the way to go, in which case the core stuff should
be renamed and moved off somewhere else. Leaving drivers/scsi with just
the actual low-level SCSI drivers.
Linus
Linus Torvalds wrote:
> On Mon, 26 May 2003, Jeff Garzik wrote:
>
>>Correct, but precisely by saying that, you're missing something.
>
>
> You're missing _my_ point.
>
>
>>The SCSI midlayer provides infrastructure I need -- which is not
>>specific to SCSI at all.
>
>
> If it isn't specific to SCSI, then it sure as hell shouldn't BE THERE!
>
> My point is that it's _wrong_ to make non-SCSI drivers use the SCSI layer,
> because that shows that something is misdesigned.
>
> And I bet there isn't all that much left that really helps.
>
> You adding more "pseudo-SCSI" crap just _makes_things_worse. It does not
> advance anything, it regresses.
As you see from Alan's message and others, it isn't pseudo-SCSI.
Besides what he mentioned, there is Serial Attached SCSI (SAS), where a
host controller can simultaneously support SAS disks and SATA disks. So
it's either an IDE driver that does SCSI, or a SCSI driver that does
IDE, or a driver that's in both IDE and SCSI subsystems, or... ? Having
fun yet? :)
> On 27 May 2003, Alan Cox wrote:
>>> I actually think thats a positive thing. It means 2.5 drivers/scsi is
>>> now very close to being the "native queueing driver" with some
>>> additional default plugins for doing scsi scanning, scsi error recovery
>>> and a few other scsi bits.
>
> Hey, that may well be the way to go, in which case the core stuff should
> be renamed and moved off somewhere else. Leaving drivers/scsi with just
> the actual low-level SCSI drivers.
For all these reasons, I continue to maintain that starting out as a
SCSI driver, and then evolving, is the best route. The non-SCSI parts
leave drivers/scsi, as they should. The SCSI parts stay. The SCSI
mid-layer gets smaller. All the while, the driver continues to work.
Everybody wins.
Starting out as a native block driver and _then_ adding SCSI support and
native queueing and jazz does not sound even remotely like a good path
to follow.
Jeff
On Tue, 27 May 2003, Jeff Garzik wrote:
>
> As you see from Alan's message and others, it isn't pseudo-SCSI.
It _is_ pseudo-scsi.
Or rather, what used to be SCSI is quickly becoming irrelevant. There's
almost nothing left, except for the command set. And SCSI is a lot more
than the command set, it's the full definition of the signalling from
command set to electricals to connectors.
And the other stuff matters. The linux SCSI layer proper is full of the
_addressing_ that is part of the SCSI standard proper, and that is pretty
much total nonsense outside of that standard (it's starting to be nonsense
even inside that standard, since everybody running away fron the old buses
and the old addressing).
So we shouldn't call it SCSI, because it clearly IS NOT, whatever you
claim. This is a _fact_, I don't see why you argue against it. SCSI has a
well-defined definition (or rather, a _set_ of definitions), and SATA
ain't there. One is T10, the other is part of T13.
And quite frankly, names matter, and calling it SCSI is clearly wrong.
What makes you _think_ it is SCSI is that everybody uses the command set,
and all devices are starting to largely just talk MMC-2+.
But calling it MMC-2 is also incorrect, since everybody really talks a
superset, and we should just accept that and not try to limit outselves.
Linus
On Mon, May 26 2003, Linus Torvalds wrote:
>
> On Mon, 26 May 2003, Jens Axboe wrote:
> >
> > > Think of all the fairness issues we've had in the elevator code, and
> > > realize that the low-level disk probably implements _none_ of those
> > > fairness algorithms.
> >
> > I think it does, to some extent at least.
>
> I doubt they do a very good job of it. I know of bad cases, even with
> "high-end" hardware. Sure, we can hope that it's getting better, but do we
> want to bet on it.
No I agree, it's always nice to handle these cases in software so we
don't have to rely on these things getting fixed.
> > > Hmm.. Where does it keep track of request latency for requests that have
> > > been removed from the queue?
> >
> > Well, it doesn't...
>
> Yeah. Which means that right now _really_ long starvation will show up as
> timeouts, while other cases will just show up as bad latency.
>
> Which will _work_, of course (assuming the timeout handling is correct,
> which is a big if in itself), but it still sucks from a usability
> standpoint.
>
> Even if we drop our timeouts from 30 seconds (or whatever they are now)
> down to just a few seconds, that's a _loooong_ time, and we should be a
> lot more proactive about things. Audio/video stuff tends to want things
> with latencies in the tenth-of-a-second range, even when they buffer
> things up internally to hide the worst cases.
Here's something ridicolously simple, that just wont start a new tag if
the oldest tag is older than 100ms. Clearly nothing for submission, but
it gets the point across.
Now only look at reads, and we've got something a little useful at
least.
James, speaking of queue localities and tcq... Doug mentioned some time
ago that aic7xxx dishes out tags numbers from a hba pool which makes it
impossible to support with out current block layer queueing code. Maybe
it we associate the blk_queue_tag structure with a bunch of queues
instead of having a 1:1 mapping it could work.
===== drivers/block/ll_rw_blk.c 1.170 vs edited =====
--- 1.170/drivers/block/ll_rw_blk.c Thu May 8 11:30:11 2003
+++ edited/drivers/block/ll_rw_blk.c Tue May 27 08:44:44 2003
@@ -574,6 +574,13 @@
BUG();
}
+ if (!list_empty(&bqt->busy_list)) {
+ struct request *__rq = list_entry_rq(bqt->busy_list.prev);
+
+ if (time_after(rq->timeout, jiffies))
+ return 1;
+ }
+
for (map = bqt->tag_map; *map == -1UL; map++) {
tag += BLK_TAGS_PER_LONG;
@@ -589,6 +596,7 @@
bqt->tag_index[tag] = rq;
blkdev_dequeue_request(rq);
list_add(&rq->queuelist, &bqt->busy_list);
+ rq->timeout = jiffies + HZ / 10;
bqt->busy++;
return 0;
}
--
Jens Axboe
On Mon, 26 May 2003, Linus Torvalds wrote:
>
> And quite frankly, names matter, and calling it SCSI is clearly wrong.
Btw, in case you wonder why I care about names and organization, it's
because with the names and organization comes assumptions and
expectations.
One prime example of this is cdrecord, and the incredible braindamage that
the name "SCSI" foisted upon it. Why? Because everybody (ie schily)
_knows_ that SCSI is addressed by bus/id/lun, and thinks that anything
else is wrong. So you have total idiocies like the "cdrecord -scanbus"
crap for finding your device, and totally useless naming that makes no
sense in any sane environment.
Calling something SCSI when it isn't brings on these kinds of bad things:
people make assuptions that aren't sensible or desireable.
Names have power. There's baggage and assumptions in a name. In the case
of SCSI, there is a _lot_ of baggage.
Linus
Linus Torvalds wrote:
> One prime example of this is cdrecord, and the incredible braindamage that
> the name "SCSI" foisted upon it. Why? Because everybody (ie schily)
> _knows_ that SCSI is addressed by bus/id/lun, and thinks that anything
> else is wrong. So you have total idiocies like the "cdrecord -scanbus"
> crap for finding your device, and totally useless naming that makes no
> sense in any sane environment.
>
> Calling something SCSI when it isn't brings on these kinds of bad things:
> people make assuptions that aren't sensible or desireable.
>
> Names have power. There's baggage and assumptions in a name. In the case
> of SCSI, there is a _lot_ of baggage.
Now that argument I can buy.
There's still helper functions to be created before a native block
driver can directly use struct requests for fully native queueing.
Brand new device, host registration code. PM, hotplug, yadda yadda. It
winds up being a lot of code still, and it not as simple as you and Jens
seem to be making the task out to be. That's why I brought up
/dev/{disk,floppy,cdrom}...
If all that work is to be done for a brand new, native block driver, we
should at least intend on using the code as a bus-agnostic command
transport layer, with packages of helpers like my current "libata" doing
the command set work (and sometimes, some amount of low-level driver
work, where commonality exists).
Jeff
On Mon, May 26 2003, Jens Axboe wrote:
> On Mon, May 26 2003, James Bottomley wrote:
> > On Mon, 2003-05-26 at 15:07, Jens Axboe wrote:
> > > Alright, so what do you need? Start out with X tags, shrink to Y (based
> > > on repeated queue full conditions)? Anything else?
> >
> > Actually, it's easier than that: just an API to alter the number of tags
> > in the block layer (really only the size of your internal hash table).
> > The actual heuristics of when to alter the queue depth is the province
> > of the individual drivers (although Doug Ledford was going to come up
> > with a generic implementation).
>
> That's actually what I meant, that the SCSI layer would call down into
> the block layer to set the size. I don't/want to know about queue full
> conditions.
>
> The internal memory requirements for the queue table is small (a bit per
> tag), so I think we can basically get away with just decrementing
> ->max_depth.
James, something like this would be enough then (untested, compiles)?
===== drivers/block/ll_rw_blk.c 1.170 vs edited =====
--- 1.170/drivers/block/ll_rw_blk.c Thu May 8 11:30:11 2003
+++ edited/drivers/block/ll_rw_blk.c Tue May 27 14:37:20 2003
@@ -413,11 +413,12 @@
{
struct blk_queue_tag *bqt = q->queue_tags;
- if (unlikely(bqt == NULL || bqt->max_depth < tag))
+ if (unlikely(bqt == NULL))
return NULL;
return bqt->tag_index[tag];
}
+
/**
* blk_queue_free_tags - release tag maintenance info
* @q: the request queue for the device
@@ -448,38 +449,26 @@
q->queue_flags &= ~(1 << QUEUE_FLAG_QUEUED);
}
-/**
- * blk_queue_init_tags - initialize the queue tag info
- * @q: the request queue for the device
- * @depth: the maximum queue depth supported
- **/
-int blk_queue_init_tags(request_queue_t *q, int depth)
+static int init_tag_map(struct blk_queue_tag *tags, int depth)
{
- struct blk_queue_tag *tags;
int bits, i;
if (depth > (queue_nr_requests*2)) {
depth = (queue_nr_requests*2);
- printk("blk_queue_init_tags: adjusted depth to %d\n", depth);
+ printk(KERN_ERR "%s: adjusted depth to %d\n", __FUNCTION__, depth);
}
- tags = kmalloc(sizeof(struct blk_queue_tag),GFP_ATOMIC);
- if (!tags)
- goto fail;
-
tags->tag_index = kmalloc(depth * sizeof(struct request *), GFP_ATOMIC);
if (!tags->tag_index)
- goto fail_index;
+ goto fail;
bits = (depth / BLK_TAGS_PER_LONG) + 1;
tags->tag_map = kmalloc(bits * sizeof(unsigned long), GFP_ATOMIC);
if (!tags->tag_map)
- goto fail_map;
+ goto fail;
memset(tags->tag_index, 0, depth * sizeof(struct request *));
memset(tags->tag_map, 0, bits * sizeof(unsigned long));
- INIT_LIST_HEAD(&tags->busy_list);
- tags->busy = 0;
tags->max_depth = depth;
/*
@@ -488,22 +477,89 @@
for (i = depth; i < bits * BLK_TAGS_PER_LONG; i++)
__set_bit(i, tags->tag_map);
+ return 0;
+fail:
+ kfree(tags->tag_index);
+ return -ENOMEM;
+}
+
+
+/**
+ * blk_queue_init_tags - initialize the queue tag info
+ * @q: the request queue for the device
+ * @depth: the maximum queue depth supported
+ **/
+int blk_queue_init_tags(request_queue_t *q, int depth)
+{
+ struct blk_queue_tag *tags;
+
+ tags = kmalloc(sizeof(struct blk_queue_tag),GFP_ATOMIC);
+ if (!tags)
+ goto fail;
+
+ if (init_tag_map(tags, depth))
+ goto fail;
+
+ INIT_LIST_HEAD(&tags->busy_list);
+ tags->busy = 0;
+
/*
* assign it, all done
*/
q->queue_tags = tags;
q->queue_flags |= (1 << QUEUE_FLAG_QUEUED);
return 0;
-
-fail_map:
- kfree(tags->tag_index);
-fail_index:
- kfree(tags);
fail:
+ kfree(tags);
return -ENOMEM;
}
/**
+ * blk_queue_resize_tags - change the queueing depth
+ * @q: the request queue for the device
+ * @new_depth: the new max command queueing depth
+ *
+ * Notes:
+ * Must be called with the queue lock held.
+ **/
+int blk_queue_resize_tags(request_queue_t *q, int new_depth)
+{
+ struct blk_queue_tag *bqt = q->queue_tags;
+ struct request **tag_index;
+ unsigned long *tag_map;
+ int bits, max_depth;
+
+ if (!bqt)
+ return -ENXIO;
+
+ /*
+ * don't bother sizing down
+ */
+ if (new_depth <= bqt->max_depth) {
+ bqt->max_depth = new_depth;
+ return 0;
+ }
+
+ /*
+ * save the old state info, so we can copy it back
+ */
+ tag_index = bqt->tag_index;
+ tag_map = bqt->tag_map;
+ max_depth = bqt->max_depth;
+
+ if (init_tag_map(bqt, new_depth))
+ return -ENOMEM;
+
+ memcpy(bqt->tag_index, tag_index, max_depth * sizeof(struct request *));
+ bits = (max_depth / BLK_TAGS_PER_LONG) + 1;
+ memcpy(bqt->tag_map, bqt->tag_map, bits * sizeof(unsigned long));
+
+ kfree(tag_index);
+ kfree(tag_map);
+ return 0;
+}
+
+/**
* blk_queue_end_tag - end tag operations for a request
* @q: the request queue for the device
* @tag: the tag that has completed
@@ -523,9 +579,6 @@
int tag = rq->tag;
BUG_ON(tag == -1);
-
- if (unlikely(tag >= bqt->max_depth))
- return;
if (unlikely(!__test_and_clear_bit(tag, bqt->tag_map))) {
printk("attempt to clear non-busy tag (%d)\n", tag);
===== include/linux/blkdev.h 1.105 vs edited =====
--- 1.105/include/linux/blkdev.h Thu May 8 11:30:11 2003
+++ edited/include/linux/blkdev.h Tue May 27 12:36:53 2003
@@ -452,6 +452,7 @@
extern void blk_queue_end_tag(request_queue_t *, struct request *);
extern int blk_queue_init_tags(request_queue_t *, int);
extern void blk_queue_free_tags(request_queue_t *);
+extern int blk_queue_resize_tags(request_queue_t *, int);
extern void blk_queue_invalidate_tags(request_queue_t *);
extern void blk_congestion_wait(int rw, long timeout);
--
Jens Axboe
On Tue, 2003-05-27 at 02:54, Jens Axboe wrote:
> James, speaking of queue localities and tcq... Doug mentioned some time
> ago that aic7xxx dishes out tags numbers from a hba pool which makes it
> impossible to support with out current block layer queueing code. Maybe
> it we associate the blk_queue_tag structure with a bunch of queues
> instead of having a 1:1 mapping it could work.
Yes, A large number of devices (not just the aic7xxx) have a single
issue queue for all outgoing requests (this is the reason for the
can_queue limit in the host template).
If the issue queue is < 256 slots, it makes a lot of sense to use global
tag numbers instead of device local ones, since then you can simply map
the returning tag number to an index in the issue queue and not bother
having to keep a hash table of <pun, lun, tag> to look it up in.
Even drivers which have larger or variable size issue queues (here the
qlogic ones spring immediately to mind) often have a virtual index
number attached to their commands (which, this time is 16 bits) which
they'd like to use as a unique offset into the issue queue. Obviously,
they face exactly the same challenges as tags and they'd like a similar
solution.
A global structure for a bunch of queues would probably be the most
useful way forwards.
James
On Tue, 2003-05-27 at 08:39, Jens Axboe wrote:
> James, something like this would be enough then (untested, compiles)?
Yes...the only concern I would have is that if you downsize the tag map,
you don't seem to keep any memory of what the high water mark actually
is. Thus, I can drop the depth by 8 and then increase it again by 4 and
you won't see that the increase can be accommodated by the already
allocated space.
James
On Tue, 27 May 2003, Jens Axboe wrote:
>
> Here's something ridicolously simple, that just wont start a new tag if
> the oldest tag is older than 100ms. Clearly nothing for submission, but
> it gets the point across.
Yes, I think something like this should work very well.
In fact, it might fit in very well indeed with AS - and in general it
might be a good idea to have some nice interface for the IO scheduler to
give this kind of ordering hints down to the hardware.
Linus
On Tue, 2003-05-27 at 10:36, Linus Torvalds wrote:
Btw, in case you wonder why I care about names and organization, it's
because with the names and organization comes assumptions and
expectations.
One prime example of this is cdrecord, and the incredible braindamage that
the name "SCSI" foisted upon it. Why? Because everybody (ie schily)
_knows_ that SCSI is addressed by bus/id/lun, and thinks that anything
else is wrong. So you have total idiocies like the "cdrecord -scanbus"
crap for finding your device, and totally useless naming that makes no
sense in any sane environment.
Calling something SCSI when it isn't brings on these kinds of bad things:
people make assuptions that aren't sensible or desireable.
Names have power. There's baggage and assumptions in a name. In the case
of SCSI, there is a _lot_ of baggage.
I took this one on board a long time ago. Even in the SCSI world, FC
devices don't think in terms of PUN, they think in terms of WWN.
If you look at the mid layer (and I don't promise this to be complete
yet) we're moving away from referring to things by host/channel/id/lun.
Now we just have host/device list in most places. About the only place
we convert back to the numbers is to print messages.
We're certainly not there yet, since we need to support legacy
interfaces like /proc/scsi/scsi. But eventually you'll probably see us
using the sysfs name instead of the id (FC devices will probably stuff
WWNs in here, other things may use numbers) and lun (not sure how we'll
represent SCSI-3 LUN hierarchies yet). Hopefully, it will be possible
to make the mid layer entirely unaware of any id/lun distinction so it
could be configured for a flatter host/device space instead.
James
On Tue, May 27, 2003 at 10:59:51AM -0400, James Bottomley wrote:
> We're certainly not there yet, since we need to support legacy
> interfaces like /proc/scsi/scsi. But eventually you'll probably see us
> using the sysfs name instead of the id (FC devices will probably stuff
> WWNs in here, other things may use numbers) and lun (not sure how we'll
> represent SCSI-3 LUN hierarchies yet). Hopefully, it will be possible
> to make the mid layer entirely unaware of any id/lun distinction so it
> could be configured for a flatter host/device space instead.
ATA defines WWN too... I'm curious what the format is? uuid-ish?
Jeff
On Tue, 2003-05-27 at 11:21, Jeff Garzik wrote:
> ATA defines WWN too... I'm curious what the format is? uuid-ish?
Heh, you'll regret asking that one. There are currently seven possible
identifier formats for the WWN (which is basically the SCSI Device
Identification VPD page).
The extremely gory details are in the SPC-3 spec (section 7.6.4 of r12).
However, really, as far as Linux is concerned we don't need to care. It
just needs to be reducible to an ASCII representation and dumped into
the sysfs name or embedded into bus_id. The reduction to ASCII can be
subsystem (or even device driver) specific.
However, note that I'm not thinking of forcing the WWN here. I'm
thinking of using whatever makes most sense to the device, so SPI
devices will continue to use simple target/lun numbers here. FC devices
will probably want to use their variant of WWN/PortID. The rationale is
to get rid of the unnecessary internal mappings some drivers use to get
to the physical address they send on the wire.
Thus, if you never address ATA devices by the WWN, you probably never
want to make it part of the addressing scheme.
Exporting a unique ID for userspace to use is a different (and probably
orthogonal) issue.
James
On Tue, May 27, 2003 at 11:38:50AM -0400, James Bottomley wrote:
> Thus, if you never address ATA devices by the WWN, you probably never
> want to make it part of the addressing scheme.
>
> Exporting a unique ID for userspace to use is a different (and probably
> orthogonal) issue.
Oh, no question. My main interest is having a persistent id for a
device's media. Then Linux can use that to allow mapping in case device
names or majors change at each boot (and similar situations).
Jeff
On Tue, 2003-05-27 at 11:50, Jeff Garzik wrote:
> Oh, no question. My main interest is having a persistent id for a
> device's media. Then Linux can use that to allow mapping in case device
> names or majors change at each boot (and similar situations).
Well, OK, that's not an in-kernel issue.
My current thought for this is that deriving the unique name can be
awfully device specific (even in SCSI there are several fallback methods
plus the usual black/white list of things that don't quite return
entirely unique objects). Thus, I think the derivation probably belongs
in userspace as part of hotplug. Once the unique ID is determined, it
could be written to a well known place in the sysfs tree for all the
rest of the OS (things like udev for persistent device naming) to use.
James
On Tue, May 27, 2003 at 12:00:28PM -0400, James Bottomley wrote:
> On Tue, 2003-05-27 at 11:50, Jeff Garzik wrote:
> > Oh, no question. My main interest is having a persistent id for a
> > device's media. Then Linux can use that to allow mapping in case device
> > names or majors change at each boot (and similar situations).
>
> Well, OK, that's not an in-kernel issue.
It's definitely an in-kernel issue, because the mapping is in-kernel now:
<major,minor> -> queue
> My current thought for this is that deriving the unique name can be
> awfully device specific (even in SCSI there are several fallback methods
> plus the usual black/white list of things that don't quite return
> entirely unique objects).
Agreed. And kernel assistance is likely necessary in many cases to
obtain the unique id, even if it's only userspace that's reading it.
> Thus, I think the derivation probably belongs
> in userspace as part of hotplug. Once the unique ID is determined, it
> could be written to a well known place in the sysfs tree for all the
> rest of the OS (things like udev for persistent device naming) to use.
Agreed, with the above proviso.
Jeff
On Tue, May 27 2003, James Bottomley wrote:
> On Tue, 2003-05-27 at 08:39, Jens Axboe wrote:
> > James, something like this would be enough then (untested, compiles)?
>
> Yes...the only concern I would have is that if you downsize the tag map,
> you don't seem to keep any memory of what the high water mark actually
> is. Thus, I can drop the depth by 8 and then increase it again by 4 and
> you won't see that the increase can be accommodated by the already
> allocated space.
If you increase it again, the maps are resized. Is that a problem? Seems
ok to me.
--
Jens Axboe
On Tue, 2003-05-27 at 13:16, Jens Axboe wrote:
> If you increase it again, the maps are resized. Is that a problem? Seems
> ok to me.
What I mean is that you allocate memory whenever the depth increases.
Even if you have an array large enough to accommodate the increase
(because you don't release when you decrease the tag depth).
On further examination, there's also an invalid tag race: If a device
is throttling, it might want to do a big decrease followed fairly
quickly by a small increase. When it does the increase, you potentially
still have outstanding tags above the new depth, which will now run off
the end of your newly allocated tag array.
James
On Tue, May 27 2003, James Bottomley wrote:
> On Tue, 2003-05-27 at 13:16, Jens Axboe wrote:
> > If you increase it again, the maps are resized. Is that a problem? Seems
> > ok to me.
>
> What I mean is that you allocate memory whenever the depth increases.
> Even if you have an array large enough to accommodate the increase
> (because you don't release when you decrease the tag depth).
Yes I know what you mean, but my question is if it's worth it to keep
track of? No memory is really lost, but we are doing a copy of tag map
and bitmap of course in addition to the extra allocation. Given that
you're not going to change depths 100 times per seconds, I don't think
it's worth it to actually do anything about it. Especially since you'll
quickly settle at the desired depth and stay there. But hey, I'm an
accomodating guy (pfft), so here's the change just for you :)
> On further examination, there's also an invalid tag race: If a device
> is throttling, it might want to do a big decrease followed fairly
> quickly by a small increase. When it does the increase, you potentially
> still have outstanding tags above the new depth, which will now run off
> the end of your newly allocated tag array.
Oh yes you are right. How does the attached look? With real_max_depth,
that should work as well since we'll only ever alloc a bigger area.
===== drivers/block/ll_rw_blk.c 1.170 vs edited =====
--- 1.170/drivers/block/ll_rw_blk.c Thu May 8 11:30:11 2003
+++ edited/drivers/block/ll_rw_blk.c Tue May 27 20:21:00 2003
@@ -413,11 +413,12 @@
{
struct blk_queue_tag *bqt = q->queue_tags;
- if (unlikely(bqt == NULL || bqt->max_depth < tag))
+ if (unlikely(bqt == NULL || tag >= bqt->real_max_depth))
return NULL;
return bqt->tag_index[tag];
}
+
/**
* blk_queue_free_tags - release tag maintenance info
* @q: the request queue for the device
@@ -448,39 +449,28 @@
q->queue_flags &= ~(1 << QUEUE_FLAG_QUEUED);
}
-/**
- * blk_queue_init_tags - initialize the queue tag info
- * @q: the request queue for the device
- * @depth: the maximum queue depth supported
- **/
-int blk_queue_init_tags(request_queue_t *q, int depth)
+static int init_tag_map(struct blk_queue_tag *tags, int depth)
{
- struct blk_queue_tag *tags;
int bits, i;
if (depth > (queue_nr_requests*2)) {
depth = (queue_nr_requests*2);
- printk("blk_queue_init_tags: adjusted depth to %d\n", depth);
+ printk(KERN_ERR "%s: adjusted depth to %d\n", __FUNCTION__, depth);
}
- tags = kmalloc(sizeof(struct blk_queue_tag),GFP_ATOMIC);
- if (!tags)
- goto fail;
-
tags->tag_index = kmalloc(depth * sizeof(struct request *), GFP_ATOMIC);
if (!tags->tag_index)
- goto fail_index;
+ goto fail;
bits = (depth / BLK_TAGS_PER_LONG) + 1;
tags->tag_map = kmalloc(bits * sizeof(unsigned long), GFP_ATOMIC);
if (!tags->tag_map)
- goto fail_map;
+ goto fail;
memset(tags->tag_index, 0, depth * sizeof(struct request *));
memset(tags->tag_map, 0, bits * sizeof(unsigned long));
- INIT_LIST_HEAD(&tags->busy_list);
- tags->busy = 0;
tags->max_depth = depth;
+ tags->real_max_depth = bits * BITS_PER_LONG;
/*
* set the upper bits if the depth isn't a multiple of the word size
@@ -488,22 +478,89 @@
for (i = depth; i < bits * BLK_TAGS_PER_LONG; i++)
__set_bit(i, tags->tag_map);
+ return 0;
+fail:
+ kfree(tags->tag_index);
+ return -ENOMEM;
+}
+
+
+/**
+ * blk_queue_init_tags - initialize the queue tag info
+ * @q: the request queue for the device
+ * @depth: the maximum queue depth supported
+ **/
+int blk_queue_init_tags(request_queue_t *q, int depth)
+{
+ struct blk_queue_tag *tags;
+
+ tags = kmalloc(sizeof(struct blk_queue_tag),GFP_ATOMIC);
+ if (!tags)
+ goto fail;
+
+ if (init_tag_map(tags, depth))
+ goto fail;
+
+ INIT_LIST_HEAD(&tags->busy_list);
+ tags->busy = 0;
+
/*
* assign it, all done
*/
q->queue_tags = tags;
q->queue_flags |= (1 << QUEUE_FLAG_QUEUED);
return 0;
-
-fail_map:
- kfree(tags->tag_index);
-fail_index:
- kfree(tags);
fail:
+ kfree(tags);
return -ENOMEM;
}
/**
+ * blk_queue_resize_tags - change the queueing depth
+ * @q: the request queue for the device
+ * @new_depth: the new max command queueing depth
+ *
+ * Notes:
+ * Must be called with the queue lock held.
+ **/
+int blk_queue_resize_tags(request_queue_t *q, int new_depth)
+{
+ struct blk_queue_tag *bqt = q->queue_tags;
+ struct request **tag_index;
+ unsigned long *tag_map;
+ int bits, max_depth;
+
+ if (!bqt)
+ return -ENXIO;
+
+ /*
+ * don't bother sizing down
+ */
+ if (new_depth <= bqt->real_max_depth) {
+ bqt->max_depth = new_depth;
+ return 0;
+ }
+
+ /*
+ * save the old state info, so we can copy it back
+ */
+ tag_index = bqt->tag_index;
+ tag_map = bqt->tag_map;
+ max_depth = bqt->real_max_depth;
+
+ if (init_tag_map(bqt, new_depth))
+ return -ENOMEM;
+
+ memcpy(bqt->tag_index, tag_index, max_depth * sizeof(struct request *));
+ bits = max_depth / BLK_TAGS_PER_LONG;
+ memcpy(bqt->tag_map, bqt->tag_map, bits * sizeof(unsigned long));
+
+ kfree(tag_index);
+ kfree(tag_map);
+ return 0;
+}
+
+/**
* blk_queue_end_tag - end tag operations for a request
* @q: the request queue for the device
* @tag: the tag that has completed
@@ -524,7 +581,7 @@
BUG_ON(tag == -1);
- if (unlikely(tag >= bqt->max_depth))
+ if (unlikely(tag >= bqt->real_max_depth))
return;
if (unlikely(!__test_and_clear_bit(tag, bqt->tag_map))) {
===== include/linux/blkdev.h 1.105 vs edited =====
--- 1.105/include/linux/blkdev.h Thu May 8 11:30:11 2003
+++ edited/include/linux/blkdev.h Tue May 27 20:15:31 2003
@@ -179,7 +179,8 @@
unsigned long *tag_map; /* bit map of free/busy tags */
struct list_head busy_list; /* fifo list of busy tags */
int busy; /* current depth */
- int max_depth;
+ int max_depth; /* what we will send to device */
+ int real_max_depth; /* what the array can hold */
};
struct request_queue
@@ -452,6 +453,7 @@
extern void blk_queue_end_tag(request_queue_t *, struct request *);
extern int blk_queue_init_tags(request_queue_t *, int);
extern void blk_queue_free_tags(request_queue_t *);
+extern int blk_queue_resize_tags(request_queue_t *, int);
extern void blk_queue_invalidate_tags(request_queue_t *);
extern void blk_congestion_wait(int rw, long timeout);
--
Jens Axboe
On Tue, 2003-05-27 at 14:21, Jens Axboe wrote:
> Oh yes you are right. How does the attached look? With real_max_depth,
> that should work as well since we'll only ever alloc a bigger area.
That looks perfect. You submit the block layer changes, I'll plumb it
into SCSI and we can try it out....just as soon as I convert the one
SCSI driver that uses the block queue tags to do dynamic queue
adjustment...
James
On Tue, May 27 2003, Linus Torvalds wrote:
>
> On Tue, 27 May 2003, Jens Axboe wrote:
> >
> > Here's something ridicolously simple, that just wont start a new tag if
> > the oldest tag is older than 100ms. Clearly nothing for submission, but
> > it gets the point across.
>
> Yes, I think something like this should work very well.
Agree, it should take the edge of crappy hardware at least.
> In fact, it might fit in very well indeed with AS - and in general it
> might be a good idea to have some nice interface for the IO scheduler to
> give this kind of ordering hints down to the hardware.
And deadline, they share the same request expire mechanism. But I read
your hint, I'll add the hint and fix this for real. Was waiting for the
other tcq patch to be comitted as they overlap, but I see that is in
so...
--
Jens Axboe
On Tue, May 27, 2003 at 12:16:24PM -0400, Jeff Garzik wrote:
> > Well, OK, that's not an in-kernel issue.
>
> It's definitely an in-kernel issue, because the mapping is in-kernel now:
>
> <major,minor> -> queue
The mapping in the kernel is gendisk -> queue. On the syscall
boundary we have an addition dev_t -> gendisk mapping. The
scsi midlayer e.g. doesn't care about dev_t or even major / minor
at all.
At 11:21 AM 27/05/2003 -0400, Jeff Garzik wrote:
>ATA defines WWN too... I'm curious what the format is? uuid-ish?
FC uses a 64-bit identifier for a WWN.
you end up with things like "50:00:0e:10:00:00:01:1c".
ironically, FC switching doesn't actually 'use' the WWN but a 24-bit FC_ID.
for things like mappings -- who knows -- perhaps WWN is good -- perhaps
something like a serial # is also good -- perhaps both are good?
mel-stglab-mds9509-1# show scsi-target lun vsan 1
- DF350F from HITACHI (Rev 0000)
FCID is 0x7f0002 in VSAN 1, PWWN is 50:00:0e:10:00:00:01:1c
------------------------------------------------------------------------------
LUN Capacity Status Serial Number Device-Id
(MB)
------------------------------------------------------------------------------
0x0 16651 Online C:2 A:0 T:1 HITACHI
D35067950000
0x1 16651 Online C:2 A:0 T:1 HITACHI
D35067950001
- ST318452FC from HP (Rev HP03)
FCID is 0x7f01da in VSAN 1, PWWN is 22:00:00:04:cf:8c:1f:95
------------------------------------------------------------------------------
LUN Capacity Status Serial Number Device-Id
(MB)
------------------------------------------------------------------------------
0x0 17920 Online 3EV0JVQJ C:1 A:0 T:3
20:00:00:04:cf:8c:1f:95
[..]
cheers,
lincoln.
JG,
This is the right move after months of debating ideas.
Migrating the contents of ./drivers/ide/pci to scsi is reasonable too.
This will allow the legacy drivers to die.
Taskfile came to late to be really useful, it is now best used as a model
to tackle FIS/FPDMA mapping for SATA II.
Linus, my professional opinion is to follow Jeff's direction for 2.5/2.6.
This will allow Linux to push open source to the hardware vendors.
There are several bastardized scsi-ide drivers in ./scsi.
pci2000.c,h :: pci2220i.c,h :: psi240i.c,h + psi_*.h :: eata*
ATAPI is nothing more than Bastardized SCSI beaten with an ugly stick.
Packet-Taskfile is more than painful, and I am not in the fighting mood.
Cheers,
Andre
On Mon, 26 May 2003, Jeff Garzik wrote:
> Just to echo some comments I said in private, this driver is _not_
> a replacement for drivers/ide. This is not, and has never been,
> the intention. In fact, I need drivers/ide's continued existence,
> so that I may have fewer boundaries on future development.
>
> Even though ATAPI support doesn't exist and error handling is
> primitive, this driver has been extensively tested locally and I feel
> is ready for a full and public kernel developer assault :)
>
> James ok'd sending this... I'll be sending "un-hack scsi headers" patch
> through him via his scsi-misc-2.5 tree.
>
>
>
>
> Linus, please do a
>
> bk pull bk://kernel.bkbits.net/jgarzik/scsi-2.5
>
> Others may download the patch from
>
> ftp://ftp.kernel.org/pub/linux/kernel/people/jgarzik/patchkits/2.5/2.5.69-bk18-scsi1.patch.bz2
>
> This will update the following files:
>
> drivers/scsi/Kconfig | 27
> drivers/scsi/Makefile | 1
> drivers/scsi/ata_piix.c | 322 ++++++
> drivers/scsi/libata.c | 2247 ++++++++++++++++++++++++++++++++++++++++++++++++
> include/linux/ata.h | 485 ++++++++++
> 5 files changed, 3082 insertions(+)
>
> through these ChangeSets:
>
> <[email protected]> (03/05/26 1.1357)
> [scsi ata] make PATA config option actually do something useful
>
> <[email protected]> (03/05/26 1.1356)
> [scsi ata] include hacks, b/c scsi headers not in include/linux
>
> <[email protected]> (03/05/26 1.1355)
> [scsi] add ATA driver
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
Andre Hedrick
LAD Storage Consulting Group
On Llu, 2003-06-02 at 10:46, Andre Hedrick wrote:
> Linus, my professional opinion is to follow Jeff's direction for 2.5/2.6.
> This will allow Linux to push open source to the hardware vendors.
> There are several bastardized scsi-ide drivers in ./scsi.
>
> pci2000.c,h :: pci2220i.c,h :: psi240i.c,h + psi_*.h :: eata*
megaraid, i2o, 3ware, aacraid all also drive IDE devices too now
Promise new driver is using the scsi layer as well (because they do
command queuing as well as drive tagged queue stuff). I've been
hacking on the SI stuff a bit and its also apparent our IDE PATA
layer won't support that well in a hurry either since it also
has command buffering.