On 8/13/2013 4:30 PM, Bjorn Helgaas wrote:
> [+cc James in case he has opinions on the DMA mask question]
>
> On Tue, Aug 13, 2013 at 10:12 AM, Chris Metcalf <[email protected]> wrote:
>> (Trimming the quoted material a little to try to keep this email under control.)
>>
>> On 8/12/2013 4:42 PM, Bjorn Helgaas wrote:
>>> On Mon, Aug 12, 2013 at 1:42 PM, Chris Metcalf <[email protected]> wrote:
>>>> On 8/9/2013 6:42 PM, Bjorn Helgaas wrote:
>>>>> OK, so physical memory in the [3GB,4GB] range is unreachable via DMA
>>>>> as you describe. And even if DMA *could* reach it, the CPU couldn't
>>>>> see it because CPU accesses to that range would go to PCI for the
>>>>> memory-mapped BAR space, not to memory.
>>>> Right. Unreachability is only a problem if the DMA window overlaps [3G, 4G], and since the 64-bit DMA window is [1TB,2TB], the whole PA space can be reached by 64-bit capable devices.
>>> So the [0-1TB] memory range (including [3GB-4GB]) is reachable by
>>> 64-bit DMA to bus addresses [1TB-2TB]. But if the CPU can't see
>>> physical memory from [3GB-4GB], how is it useful to DMA there?
>> Sorry, looking back I can see that the thread is a little confusing.
>> The CPU can see the whole PA space. The confusion comes from the BAR space
>> in [3GB, 4GB].
>>
>> On Tile, we define the CPU memory space as follows:
>>
>> [0, 1TB]: PA
>> [1TB + 3GB, 1TB + 4GB]: BAR space for RC port 0, in [3GB, 4GB]
>> [1TB + 3GB + N*4GB, 1TB + (1 + N)*4GB]: BAR space for RC port N, in [3GB, 4GB]
>>
>> The mapping from [1TB + 3GB + N*4GB, 1TB + (1 + N)*4GB] to [3GB, 4GB] is done by a
>> hardware PIO region, which generates PCI bus addresses in [3GB, 4GB] for MMIOs to
>> the BAR space.
> OK, I think I get it now. CPU address space:
> [0, 1TB]: physical memory
> [1TB + 3GB, 1TB + 4GB]: translated to bus address [3GB, 4GB] under RC port 0
> [1TB + 3GB + N*4GB, 1TB + (1 + N)*4GB]: translated to bus address
> [3GB, 4GB] under RC port N
>
> Bus address space:
> [0, 3GB]: 32-bit DMA reaches physical memory [0, 3GB]
> [3GB, 4GB]: 32-bit DMA (peer-to-peer DMA under local RC port, I guess?)

Correct.

> [1TB, 2TB]: 64-bit DMA mapped via IOMMU to physical memory [0, 1TB]
>
> I guess the problem is that 32-bit DMA can't reach physical memory
> [3GB, 4GB], so you're using bounce buffers so the bus address is in
> [0, 3GB]. That makes sense, and I don't see another possibility other
> than just throwing away the [3GB, 4GB] range by leaving it out of the
> kernel allocator altogether, or using hardware (which tilegx probably
> doesn't have) to remap it somewhere else.
>
> So it seems like just a question of how you wrap this all up in
> dma_ops, and *that* is all arch stuff that I don't have an opinion on.
>
>>>> Unfortunately, the Megaraid driver doesn?t even call pci_set_consistent_dma_mask(dev, DMA_BIT_MASK(32)).
>>> If the Megaraid driver needs that call, but it's missing, why wouldn't
>>> we just add it?
>> The Megaraid driver doesn?t strictly need that call on other platforms, because
>> by default the device coherent_dma_mask is DMA_BIT_MASK(32) and the consistent
>> memory pool doesn?t come from the bounce buffers on most other platforms.
>>
>> Of course, for the sake of correctness, this call should be added across all platforms.
>> ...
>> What is unique about Tile is that the PCI drivers must explicitly declare
>> its DMA capability by calling pci_set_dma_mask() and pci_set_consistent_dma_mask().
> It looks like the reason you need drivers to explicitly call
> pci_set_dma_mask() and pci_set_consistent_dma_mask() is because you
> have hooks in those functions to tweak the dma_ops, even though the
> mask itself might not be changed.
>
> That doesn't sound like a robust solution: we have well-known defaults
> for the mask sizes, and I don't think it's reasonable to expect
> drivers to explicitly set the mask even if they are happy with the
> defaults (though Documentation/DMA-API-HOWTO.txt does say that being
> explicit is good style). I'm afraid you'll just keep tripping over
> drivers that don't work on tilegx because they don't set the mask.

Yes, I think you (and James) are right. We will look at changing to a DMA_MASK(32) default and do some testing and see what that ends up looking like.

James also wrote:
> What you have is
> two different classes of memory: 0-3GB which is usable for I/O and 3-4GB
> which isn't. Surely what you need to do is implement ZONE_DMA32? which
> stretches from 0-3GB, which means all kmallocs in the driver will be in
> the right range. Then you need a bounce pfn at 3GB which means that
> user space which gets the 3-4GB is bounced. There's a magic pfn in
> BLK_BOUNCE_HIGH that was designed for this, but I'm not sure the design
> contemplated BLK_BOUNCE_HIGH being different for 64 and 32 bit.

Well, we do have ZONE_DMA already, for PAs 0-4GB. This is a sort of generally reasonable definition, and works well for 32-bit USB devices, for example. But you're right that it does mean that sometimes something like swiotlb_alloc_coherent() might try to allocate memory for 32-bit PCI and get something that falls in the 3-4GB hole, and it then throws it away and uses map_single(). So arguably we should just shift our ZONE_DMA ceiling down to 3GB to make PCI more likely to get a usable page, though that penalizes 32-bit USB, and if we ever thought about expanding the size of the PCI BAR there, it would penalize it more.

Bottom line, though, it does seem like in general the 32-bit devices (both PCI and USB) that we see in practice are pretty low performance, so it's hard to worry too much about it, as long as they work.

I don't know that we've looked at the BLK_BOUNCE_xxx stuff much; I'm personally not very familiar with it, except that as you say, it seems to be hard-wired to -1ULL for 64-bit systems.

--
Chris Metcalf, Tilera Corp.
http://www.tilera.com

2013-08-30 14:15:17

by Chris Metcalf

[permalink] [raw]

Subject: [PATCH] tile PCI RC: make default consistent DMA mask 32-bit

This change sets the PCI devices' initial DMA capabilities
conservatively and promotes them at the request of the driver,
as opposed to assuming advanced DMA capabilities. The old design
runs the risk of breaking drivers that assume default capabilities.

Signed-off-by: Chris Metcalf <[email protected]>
---
arch/tile/include/asm/device.h | 5 ++++-
arch/tile/include/asm/dma-mapping.h | 21 +++++++++++++--------
arch/tile/kernel/pci-dma.c | 21 ++++++++++++---------
arch/tile/kernel/pci_gx.c | 15 +++++++++++++--
4 files changed, 42 insertions(+), 20 deletions(-)

diff --git a/arch/tile/include/asm/device.h b/arch/tile/include/asm/device.h
index 5182705..6ab8bf14 100644
--- a/arch/tile/include/asm/device.h
+++ b/arch/tile/include/asm/device.h
@@ -23,7 +23,10 @@ struct dev_archdata {
/* Offset of the DMA address from the PA. */
dma_addr_t dma_offset;

- /* Highest DMA address that can be generated by this device. */
+ /*
+ * Highest DMA address that can be generated by devices that
+ * have limited DMA capability, i.e. non 64-bit capable.
+ */
dma_addr_t max_direct_dma_addr;
};

diff --git a/arch/tile/include/asm/dma-mapping.h b/arch/tile/include/asm/dma-mapping.h
index 6f522d5..1eae359 100644
--- a/arch/tile/include/asm/dma-mapping.h
+++ b/arch/tile/include/asm/dma-mapping.h
@@ -92,14 +92,19 @@ dma_set_mask(struct device *dev, u64 mask)
{
struct dma_map_ops *dma_ops = get_dma_ops(dev);

- /* Handle legacy PCI devices with limited memory addressability. */
- if ((dma_ops == gx_pci_dma_map_ops ||
- dma_ops == gx_hybrid_pci_dma_map_ops ||
- dma_ops == gx_legacy_pci_dma_map_ops) &&
- (mask <= DMA_BIT_MASK(32))) {
- set_dma_ops(dev, gx_legacy_pci_dma_map_ops);
- set_dma_offset(dev, 0);
- if (mask > dev->archdata.max_direct_dma_addr)
+ /*
+ * For PCI devices with 64-bit DMA addressing capability, promote
+ * the dma_ops to hybrid, with the consistent memory DMA space limited
+ * to 32-bit. For 32-bit capable devices, limit the streaming DMA
+ * address range to max_direct_dma_addr.
+ */
+ if (dma_ops == gx_pci_dma_map_ops ||
+ dma_ops == gx_hybrid_pci_dma_map_ops ||
+ dma_ops == gx_legacy_pci_dma_map_ops) {
+ if (mask == DMA_BIT_MASK(64) &&
+ dma_ops == gx_legacy_pci_dma_map_ops)
+ set_dma_ops(dev, gx_hybrid_pci_dma_map_ops);
+ else if (mask > dev->archdata.max_direct_dma_addr)
mask = dev->archdata.max_direct_dma_addr;
}

diff --git a/arch/tile/kernel/pci-dma.c b/arch/tile/kernel/pci-dma.c
index d94f487..09b5870 100644
--- a/arch/tile/kernel/pci-dma.c
+++ b/arch/tile/kernel/pci-dma.c
@@ -588,15 +588,18 @@ int dma_set_coherent_mask(struct device *dev, u64 mask)
{
struct dma_map_ops *dma_ops = get_dma_ops(dev);

- /* Handle hybrid PCI devices with limited memory addressability. */
- if ((dma_ops == gx_pci_dma_map_ops ||
- dma_ops == gx_hybrid_pci_dma_map_ops ||
- dma_ops == gx_legacy_pci_dma_map_ops) &&
- (mask <= DMA_BIT_MASK(32))) {
- if (dma_ops == gx_pci_dma_map_ops)
- set_dma_ops(dev, gx_hybrid_pci_dma_map_ops);
-
- if (mask > dev->archdata.max_direct_dma_addr)
+ /*
+ * For PCI devices with 64-bit DMA addressing capability, promote
+ * the dma_ops to full capability for both streams and consistent
+ * memory access. For 32-bit capable devices, limit the consistent
+ * memory DMA range to max_direct_dma_addr.
+ */
+ if (dma_ops == gx_pci_dma_map_ops ||
+ dma_ops == gx_hybrid_pci_dma_map_ops ||
+ dma_ops == gx_legacy_pci_dma_map_ops) {
+ if (mask == DMA_BIT_MASK(64))
+ set_dma_ops(dev, gx_pci_dma_map_ops);
+ else if (mask > dev->archdata.max_direct_dma_addr)
mask = dev->archdata.max_direct_dma_addr;
}

diff --git a/arch/tile/kernel/pci_gx.c b/arch/tile/kernel/pci_gx.c
index 66ef9db..29acac6 100644
--- a/arch/tile/kernel/pci_gx.c
+++ b/arch/tile/kernel/pci_gx.c
@@ -1081,13 +1081,24 @@ int pcibios_enable_device(struct pci_dev *dev, int mask)
return pci_enable_resources(dev, mask);
}

-/* Called for each device after PCI setup is done. */
+/*
+ * Called for each device after PCI setup is done.
+ * We initialize the PCI device capabilities conservatively, assuming that
+ * all devices can only address the 32-bit DMA space. The exception here is
+ * that the device dma_offset is set to the value that matches the 64-bit
+ * capable devices. This is OK because dma_offset is not used by legacy
+ * dma_ops, nor by the hybrid dma_ops's streaming DMAs, which are 64-bit ops.
+ * This implementation matches the kernel design of setting PCI devices'
+ * coherent_dma_mask to 0xffffffffull by default, allowing the device drivers
+ * to skip calling pci_set_consistent_dma_mask(DMA_BIT_MASK(32)).
+ */
static void pcibios_fixup_final(struct pci_dev *pdev)
{
- set_dma_ops(&pdev->dev, gx_pci_dma_map_ops);
+ set_dma_ops(&pdev->dev, gx_legacy_pci_dma_map_ops);
set_dma_offset(&pdev->dev, TILE_PCI_MEM_MAP_BASE_OFFSET);
pdev->dev.archdata.max_direct_dma_addr =
TILE_PCI_MAX_DIRECT_DMA_ADDRESS;
+ pdev->dev.coherent_dma_mask = TILE_PCI_MAX_DIRECT_DMA_ADDRESS;
}
DECLARE_PCI_FIXUP_FINAL(PCI_ANY_ID, PCI_ANY_ID, pcibios_fixup_final);

--
1.8.3.1