Background:
We have a large number of devices in scsi: aacraid, aic7xxx, qla1280,
qla2xxx which can all do full 64 bit DMA, but which pay a performance
penalty for using the larger descriptors (aic7xxx is stranger in that it
has three modes of operation: 32 bit, 39 bit and 64 bit each with an
increasing performance penalty).
What all these devices would like to do is instead of simply trying the
64 bit mask first and having the platform accept it, even if it only has
< 4GB of memory, they'd like to be able to have the platform tell them,
given my current dma mask setting, what's the actual number of bits you
need me to DMA to.
This is precisely what the API would do. It would return a bit mask
(never over the current dma_mask) that the platform considers optimal.
The platform has complete freedom in this: it may return a mask covering
the total physical memory, or a mask covering all the bits it needs
setting for some weird numa scheme.
If the driver decides to use the mask, it would do another
dma_set_mask() to confirm it (this gives the platform the opportunity if
it so chooses to return a mask that doesn't quite cover memory, but
would be more optimal...say for platforms that have all memory under 4GB
bar one small chunk at 64GB or something).
Once the driver has the platform's optimal mask, it can use this to
decide on the correct descriptor size.
Comments?
James
JB> Once the driver has the platform's optimal mask, it can use this to
JB> decide on the correct descriptor size.
JB>
JB> Comments?
Would it break with memory hotplug?
--
Meelis Roos
As I said in the other thread, if the SG list was not the descending
order page and instead was in ascending order allocations, we would have
a major reduction in the quantity of SG elements thus not having to make
this performance tradeoff. Not making the request moot, still worth
doing for other optimizations such as preventing DAC cycles with the
upper 32 bits all zero, but the real hit in performance for the aacraid
driver is the shear quantity of increased sized SG elements.
Currently, large requests come through as follows:
SG# Size Phys
0 4096 ef400000
1 4096 ef3ff000
2 4096 ef3fe000
3 4096 ef3fd000
. . .
Another change in F/W that is altering the interface to deal with this
problem allows us to negotiate a larger FIB (adapter command packet)
size to accommodate a huge number of SG elements. We can dodge this
bullet as well for Linux if this SG element allocation issue is
resolved.
FYI, when we allow 4M I/O to occur with this new larger FIB size, the
system performance drops to its knees. Mouse movement is sluggish etc.
there appears to be some scaling issues that I have yet to understand or
characterize to root cause.
Sincerely -- Mark Salyzyn
-----Original Message-----
From: [email protected]
[mailto:[email protected]] On Behalf Of James Bottomley
Sent: Thursday, June 17, 2004 10:09 AM
To: Linux Kernel; SCSI Mailing List
Subject: Proposal for new generic device API: dma_get_required_mask()
Background:
We have a large number of devices in scsi: aacraid, aic7xxx, qla1280,
qla2xxx which can all do full 64 bit DMA, but which pay a performance
penalty for using the larger descriptors (aic7xxx is stranger in that it
has three modes of operation: 32 bit, 39 bit and 64 bit each with an
increasing performance penalty).
What all these devices would like to do is instead of simply trying the
64 bit mask first and having the platform accept it, even if it only has
< 4GB of memory, they'd like to be able to have the platform tell them,
given my current dma mask setting, what's the actual number of bits you
need me to DMA to.
This is precisely what the API would do. It would return a bit mask
(never over the current dma_mask) that the platform considers optimal.
The platform has complete freedom in this: it may return a mask covering
the total physical memory, or a mask covering all the bits it needs
setting for some weird numa scheme.
If the driver decides to use the mask, it would do another
dma_set_mask() to confirm it (this gives the platform the opportunity if
it so chooses to return a mask that doesn't quite cover memory, but
would be more optimal...say for platforms that have all memory under 4GB
bar one small chunk at 64GB or something).
Once the driver has the platform's optimal mask, it can use this to
decide on the correct descriptor size.
Comments?
James
On Thu, Jun 17, 2004 at 09:08:51AM -0500, James Bottomley wrote:
> We have a large number of devices in scsi: aacraid, aic7xxx, qla1280,
> qla2xxx which can all do full 64 bit DMA, but which pay a performance
> penalty for using the larger descriptors (aic7xxx is stranger in that it
> has three modes of operation: 32 bit, 39 bit and 64 bit each with an
> increasing performance penalty).
sym2 too -- 32, 40 and 64-bit, but with the restriction of the 64-bit
mode that it's limited to accessing 16 4GB segments.
> Once the driver has the platform's optimal mask, it can use this to
> decide on the correct descriptor size.
Sounds good to me.
So let's look at the sym2 implementation to see how this would work.
A straight diff -u is not terribly helpful, So let's try a more verbose
context diff:
*** 1625,1657 ****
*/
static int sym_setup_bus_dma_mask(struct sym_hcb *np)
{
! #if SYM_CONF_DMA_ADDRESSING_MODE == 0
! if (pci_set_dma_mask(np->s.device, 0xffffffffUL))
! goto out_err32;
! #else
! #if SYM_CONF_DMA_ADDRESSING_MODE == 1
! #define PciDmaMask 0xffffffffffULL
! #elif SYM_CONF_DMA_ADDRESSING_MODE == 2
! #define PciDmaMask 0xffffffffffffffffULL
! #endif
if (np->features & FE_DAC) {
! if (!pci_set_dma_mask(np->s.device, PciDmaMask)) {
! np->use_dac = 1;
! printf_info("%s: using 64 bit DMA addressing\n",
! sym_name(np));
! } else {
! if (pci_set_dma_mask(np->s.device, 0xffffffffUL))
! goto out_err32;
! }
}
! #undef PciDmaMask
! #endif
return 0;
! out_err32:
! printf_warning("%s: 32 BIT DMA ADDRESSING NOT SUPPORTED\n",
! sym_name(np));
! return -1;
}
/*
--- 1625,1645 ----
*/
static int sym_setup_bus_dma_mask(struct sym_hcb *np)
{
! struct pci_dev *pdev = &np->s.device;
!
if (np->features & FE_DAC) {
! u64 required_mask = dma_set_mask(&pdev->dev, 0xffffffffffULL);
! if (required_mask > 0xffffffffUL)
! goto use_dac;
}
!
! dma_set_mask(&pdev->dev, 0xffffffffUL);
return 0;
! use_dac:
! np->use_dac = 1;
! printf_info("%s: using 64 bit DMA addressing\n", sym_name(np));
! return 0;
}
/*
There's an obvious problem here -- no error checking on the second
dma_set_mask(). How were we going to indicate "I can't do that"
errors again?
--
"Next the statesmen will invent cheap lies, putting the blame upon
the nation that is attacked, and every man will be glad of those
conscience-soothing falsities, and will diligently study them, and refuse
to examine any refutations of them; and thus he will by and by convince
himself that the war is just, and will thank God for the better sleep
he enjoys after this process of grotesque self-deception." -- Mark Twain
On Thu, 2004-06-17 at 10:28, Matthew Wilcox wrote:
> So let's look at the sym2 implementation to see how this would work.
> A straight diff -u is not terribly helpful, So let's try a more verbose
> context diff:
actually, no. I'm proposing an additional API, not a modification to
dma_set_mask(), so the new code would read
struct pci_dev *pdev = &np->s.device;
if (np->features & FE_DAC) {
if (dma_set_mask(&pdev->dev, 0xffffffffffULL))
if (dma_get_required_mask(&pdev->dev) > 0xffffffffUL)
goto use_dac;
}
dma_set_mask(&pdev->dev, 0xffffffffUL);
return 0;
use_dac:
np->use_dac = 1;
printf_info("%s: using 64 bit DMA addressing\n", sym_name(np));
return 0;
James
James Bottomley <[email protected]> writes:
> If the driver decides to use the mask, it would do another
> dma_set_mask() to confirm it (this gives the platform the opportunity if
> it so chooses to return a mask that doesn't quite cover memory, but
> would be more optimal...say for platforms that have all memory under 4GB
> bar one small chunk at 64GB or something).
What I think drivers such as AIC7xxx should do is:
#define OUR_COST_32 = 4
#define OUR_COST_39 = 8
#define OUR_COST_64 = 10
int cost32 = check_dma_mask(32 bits);
int cost39 = check_dma_mask(39 bits);
int cost64 = check_dma_mask(64 bits);
if (!cost32 && !cost39 && !cost64)
printk(KERN_ERR "64 bits aren't enough for RAM addressing?\n")
else
use_mode_with_minimal_cost(cost32 * OUR_COST_32,
cost39 * OUR_COST_39,
cost64 * OUR_COST_64);
This check_dma_mask() should be renamed + extended to cover different
RAM access types:
- coherent vs non-coherent memory
- preallocated/initialized memory (such as skb->data passed to
hard_start_xmit()) vs uninitialized memory (such as returned by
kmalloc()).
The "cost" is needed for cases where both the host and the device can
support many addressing modes, such as with AIC7xxx, > 4GB of RAM
and (costly) IO MMU or bounce buffers.
Currently, set_dma_mask(less than 32 bits) can return success but then
the mapping functions can return addresses which don't fit in the
requested number of bits. In fact set_dma_mask() has any meaning
only to *alloc functions. The statement "pci_set_consistent_dma_mask()
will always be able to set the same or a smaller mask as
pci_set_dma_mask()" doesn't make IMHO sense.
If we fix the API we should IMHO also remove set_dma_mask() and add
the number of address bits to the arguments of actual mapping
functions. It would make it possible to use different masks for
different tasks, I'm told there is hardware which can benefit from it.
Done correctly it wouldn't have any runtime overhead.
I would also change the "u64 mask" into plain number of bits.
It would be easier for people, cpp, gcc and CPU.
--
Krzysztof Halasa, B*FH
On Thu, 2004-06-17 at 19:46, Krzysztof Halasa wrote:
> #define OUR_COST_32 = 4
> #define OUR_COST_39 = 8
> #define OUR_COST_64 = 10
>
> int cost32 = check_dma_mask(32 bits);
> int cost39 = check_dma_mask(39 bits);
> int cost64 = check_dma_mask(64 bits);
>
> if (!cost32 && !cost39 && !cost64)
> printk(KERN_ERR "64 bits aren't enough for RAM addressing?\n")
> else
> use_mode_with_minimal_cost(cost32 * OUR_COST_32,
> cost39 * OUR_COST_39,
> cost64 * OUR_COST_64);
>
> This check_dma_mask() should be renamed + extended to cover different
> RAM access types:
> - coherent vs non-coherent memory
> - preallocated/initialized memory (such as skb->data passed to
> hard_start_xmit()) vs uninitialized memory (such as returned by
> kmalloc()).
Well, I did consider a similar API, but not for long.
It falls victim to the 95/5 rule---when you engineer an API, if 95% of
the complexity is dealing with the 5% of special cases, you're over
engineering.
So the original proposal is the remaining 5% that covers 95% of the use
cases (and will do better even on the remaining 5%).
James
On Thu, Jun 17, 2004 at 09:08:51AM -0500, James Bottomley wrote:
> Background:
>
> We have a large number of devices in scsi: aacraid, aic7xxx, qla1280,
> qla2xxx which can all do full 64 bit DMA, but which pay a performance
> penalty for using the larger descriptors (aic7xxx is stranger in that it
> has three modes of operation: 32 bit, 39 bit and 64 bit each with an
> increasing performance penalty).
>
> What all these devices would like to do is instead of simply trying the
> 64 bit mask first and having the platform accept it, even if it only has
> < 4GB of memory, they'd like to be able to have the platform tell them,
> given my current dma mask setting, what's the actual number of bits you
> need me to DMA to.
>
> This is precisely what the API would do. It would return a bit mask
> (never over the current dma_mask) that the platform considers optimal.
> The platform has complete freedom in this: it may return a mask covering
> the total physical memory, or a mask covering all the bits it needs
> setting for some weird numa scheme.
>
> If the driver decides to use the mask, it would do another
> dma_set_mask() to confirm it (this gives the platform the opportunity if
> it so chooses to return a mask that doesn't quite cover memory, but
> would be more optimal...say for platforms that have all memory under 4GB
> bar one small chunk at 64GB or something).
>
> Once the driver has the platform's optimal mask, it can use this to
> decide on the correct descriptor size.
>
> Comments?
>
> James
Sounds good. But I'm curious why you make the driver call dma_set_mask()
twice.
I.e., why not
mask = dma_get_required_mask(...)
if (mask < (1ull << 32)) {
dma_set_mask(... 32bit)
use_dac = 0;
}
else {
dma_set_mask(... 64 bit)
use_dac = 1;
}
Are there cases you're thinking of where the platform's required mask
would vary depending on the current mask?
jeremy
On Fri, Jun 18, 2004 at 02:46:46AM +0200, Krzysztof Halasa wrote:
> If we fix the API we should IMHO also remove set_dma_mask() and add
> the number of address bits to the arguments of actual mapping
> functions.
Good idea, except for the fact that we have drivers merged which have
real masks like 0x0fefffff. It _really is_ a mask and not a number
of bits.
--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of: 2.6 PCMCIA - http://pcmcia.arm.linux.org.uk/
2.6 Serial core
On Fri, 2004-06-18 at 00:59, Jeremy Higdon wrote:
> Sounds good. But I'm curious why you make the driver call dma_set_mask()
> twice.
Basically so that dma_get_required_mask() returns a value that may be
related to the current mask. If the device has a wierd mask setting
(say it has bits missing or something), the platform may want to return
something different to tune the required mask to be optimal.
James
Russell King <[email protected]> writes:
> Good idea, except for the fact that we have drivers merged which have
> real masks like 0x0fefffff. It _really is_ a mask and not a number
> of bits.
Which drivers do you mean? And which platforms support it?
--
Krzysztof Halasa, B*FH
James Bottomley <[email protected]> writes:
> It falls victim to the 95/5 rule---when you engineer an API, if 95% of
> the complexity is dealing with the 5% of special cases, you're over
> engineering.
>
> So the original proposal is the remaining 5% that covers 95% of the use
> cases (and will do better even on the remaining 5%).
I don't think so. We already have separate masks for coherent
and non-coherent mappings (in PCI API, and I'm told it's to be extended
to DMA API as well). And we need them.
The problem is we're missing DMA masks for non-alloc calls (depending
on the platform) and thus that it isn't very reliable. Drivers which
need this are forced to bounce buffers themselves, and many of them
will not work on 64-bit platforms (as of ~ 2.6.0, I don't check that
regularly). And yes, we really need reliable masks for non-alloc
mappings.
I can't see any added complexity in this scheme, we basically already
do all of that (except the cost, but it's hardly complex and most
drivers would just need to test check_dma_mask() for error).
--
Krzysztof Halasa, B*FH
On Fri, 2004-06-18 at 18:07, Krzysztof Halasa wrote:
> James Bottomley <[email protected]> writes:
> I don't think so. We already have separate masks for coherent
> and non-coherent mappings (in PCI API, and I'm told it's to be extended
> to DMA API as well). And we need them.
>
> The problem is we're missing DMA masks for non-alloc calls (depending
> on the platform) and thus that it isn't very reliable. Drivers which
> need this are forced to bounce buffers themselves, and many of them
> will not work on 64-bit platforms (as of ~ 2.6.0, I don't check that
> regularly). And yes, we really need reliable masks for non-alloc
> mappings.
Could you elaborate on this? In the current scheme the coherent mask is
for descriptor allocation (i.e. dma_alloc_coherent()) and the dma_mask
represents the bus physical addresses to which the device can DMA
directly; there's not much more the DMA API really does, what do you
think is missing?
James
On Sat, Jun 19, 2004 at 01:10:31AM +0200, Krzysztof Halasa wrote:
> Russell King <[email protected]> writes:
> > Good idea, except for the fact that we have drivers merged which have
> > real masks like 0x0fefffff. It _really is_ a mask and not a number
> > of bits.
>
> Which drivers do you mean? And which platforms support it?
The SA1111 device and associated sub-device drivers. Basically, Intel
has a "no fix" errata where one of the address bits gets incorrectly
routed to the SDRAM "auto precharge" address bit. This address bit
must be zero, otherwise the SDRAM accesses are messed up.
However, because SDRAM is accessed by "row" and "column" addresses,
the physical address bit which corresponds to the "auto precharge"
signal varies according to the size of SDRAM.
This is what the following table is about:
static u32 sa1111_dma_mask[] = {
~0,
~(1 << 20),
~(1 << 23),
~(1 << 24),
~(1 << 25),
~(1 << 20),
~(1 << 20),
0,
};
for each setting of the SDRAM row/column address multiplexer, we have
to ensure that a certain DMA address bit is zero. Note, however,
that the problem address bit does not correspond to the highest
addressable bit. You may have a 32MB SDRAM but need to ensure that
physical bit 20 is always zero - IOW you can only DMA from even MB
addresses and not odd MB addresses.
Hence, this ends up with DMA masks such as the example in my previous
mail.
And yes, this chip is popular, in use, and this quirk is not difficult
to cleanly work around given our existing API.
Hence why changing from "dma mask" to "number of bits" breaks existing
drivers.
--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of: 2.6 PCMCIA - http://pcmcia.arm.linux.org.uk/
2.6 Serial core
James Bottomley <[email protected]> writes:
> Could you elaborate on this? In the current scheme the coherent mask is
> for descriptor allocation (i.e. dma_alloc_coherent()) and the dma_mask
> represents the bus physical addresses to which the device can DMA
> directly; there's not much more the DMA API really does, what do you
> think is missing?
The problem is that (depending on platform) the pci_map_* and dma_map_*
functions ignore both masks. An example of such platform is i386 :-)
It seems the masks are used on i386 for only one thing - consistent
dma mask is used for consistent allocations only, and normal dma mask
is not used at all.
The normal mask is used mainly on 64-bit platforms and the meaningful
values are 2^32-1 and 2^64-1. It's used by PCI-X device drivers to
enable DAC transfers. This is why it isn't used on 32-bit platforms.
--
Krzysztof Halasa, B*FH
Russell King <[email protected]> writes:
> The SA1111 device and associated sub-device drivers. Basically, Intel
> has a "no fix" errata where one of the address bits gets incorrectly
> routed to the SDRAM "auto precharge" address bit. This address bit
> must be zero, otherwise the SDRAM accesses are messed up.
Well, I knew about it, but I thought it's a "host" problem and device
drivers don't have to care.
> You may have a 32MB SDRAM but need to ensure that
> physical bit 20 is always zero - IOW you can only DMA from even MB
> addresses and not odd MB addresses.
I understand that currently the bit in question is cleared from the
masks, and that the ARM/SA1111 dma/pci alloc/map functions know how
to handle such mask?
Wouldn't it be better to not touch the masks (which are device
capabilities rather than platform limitations) and let alloc/map
functions always use the correct half of RAM?
Just asking, I've never worked with such thing.
--
Krzysztof Halasa, B*FH
On Sat, 2004-06-19 at 18:39, Krzysztof Halasa wrote:
> The problem is that (depending on platform) the pci_map_* and dma_map_*
> functions ignore both masks. An example of such platform is i386 :-)
>
> It seems the masks are used on i386 for only one thing - consistent
> dma mask is used for consistent allocations only, and normal dma mask
> is not used at all.
Actually, I think you misunderstand the way the API works. The only
time the dma_map_ functions pay attention to the mask is in an IOMMU
transaction. For no-IOMMU systems, its far more efficient for bouncing
to occur in the upper layers (as it does for block and net).
> The normal mask is used mainly on 64-bit platforms and the meaningful
> values are 2^32-1 and 2^64-1. It's used by PCI-X device drivers to
> enable DAC transfers. This is why it isn't used on 32-bit platforms.
This statement is incorrect, Russell has already given you a counter
example.
James
> Wouldn't it be better to not touch the masks (which are device
> capabilities rather than platform limitations) and let alloc/map
> functions always use the correct half of RAM?
They are device limitations - the SA1111 device (which is a bus master
device in its own right) itself is the cause of the problem, so the
devices integrated inside it need to know.
Eg:
+------------------------+ +---------------------------+
| | | |
| SSP CPU--bus i/f-|-----+-----|-bus i/f OHCI SSP SAC |
| | | | | | | | | | |
| +----------------+ | | | +-------+-----+----+ |
| SA11x0 | +-------+ | SA1111 |
+------------------------+ | SDRAM | +---------------------------+
+-------+
The SDRAM bus is shared between the SA11x0 and SA1111 devices, and
there is an arbitration protocol to hand off the SDRAM bus between
the two devices.
The SSP and CPU itself inside the SA11x0 can access all SDRAM memory,
but the OHCI, SSP and SAC devices within the SA1111 are affected by
the problem.
So it's quite correct for this to be a device thing not a platform
thing. It _is_ the SA1111 device itself which has the problem.
--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of: 2.6 PCMCIA - http://pcmcia.arm.linux.org.uk/
2.6 Serial core
Russell King <[email protected]> writes:
> So it's quite correct for this to be a device thing not a platform
> thing. It _is_ the SA1111 device itself which has the problem.
Now I see that. Many thanks for detailed explanation.
--
Krzysztof Halasa, B*FH