2005-03-18 21:23:55

by Matt Domsch

[permalink] [raw]
Subject: [PATCH 2.4.30-pre3] x86_64: pci_alloc_consistent() match 2.6 implementation

For review and comment.

On x86_64 systems with no IOMMU and with >4GB RAM (in fact, whenever
there are any pages mapped above 4GB), pci_alloc_consistent() falls
back to using ZONE_DMA for all allocations, even if the device's
dma_mask could have supported using memory from other zones. Problems
can be seen when other ZONE_DMA users (SWIOTLB, scsi_malloc()) consume
all of ZONE_DMA, leaving none left for pci_alloc_consistent() use.

Patch below makes pci_alloc_consistent() for the nommu case (EM64T
processors) match the 2.6 implementation of dma_alloc_coherent(), with
the exception that this continues to use GFP_ATOMIC.

Signed-off-by: Matt Domsch <[email protected]>

Thanks,
Matt

--
Matt Domsch
Software Architect
Dell Linux Solutions linux.dell.com & http://www.dell.com/linux
Linux on Dell mailing lists @ http://lists.us.dell.com

--- linux-2.4/arch/x86_64/kernel/pci-nommu.c Fri Feb 25 13:01:44 2005
+++ linux-2.4/arch/x86_64/kernel/pci-nommu.c Fri Feb 25 06:56:55 2005
@@ -13,18 +13,28 @@ void *pci_alloc_consistent(struct pci_de
dma_addr_t *dma_handle)
{
void *ret;
+ u64 mask;
+ int order = get_order(size);
int gfp = GFP_ATOMIC;
-
- if (hwdev == NULL ||
- end_pfn > (hwdev->dma_mask>>PAGE_SHIFT) || /* XXX */
- (u32)hwdev->dma_mask < 0xffffffff)
- gfp |= GFP_DMA;
- ret = (void *)__get_free_pages(gfp, get_order(size));

- if (ret != NULL) {
- memset(ret, 0, size);
+ if (hwdev)
+ mask = hwdev->dma_mask;
+ else
+ mask = 0xffffffffULL;
+
+ for (;;) {
+ ret = (void *)__get_free_pages(gfp, order);
+ if (ret == NULL)
+ return NULL;
*dma_handle = virt_to_bus(ret);
+ if ((*dma_handle & ~mask) == 0)
+ break;
+ free_pages((unsigned long)ret, order);
+ if (gfp & GFP_DMA)
+ return NULL;
+ gfp |= GFP_DMA;
}
+ memset(ret, 0, size);
return ret;
}


2005-03-19 06:10:13

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [PATCH 2.4.30-pre3] x86_64: pci_alloc_consistent() match 2.6 implementation

On Fri, 2005-03-18 at 15:23 -0600, Matt Domsch wrote:
> For review and comment.
>
> On x86_64 systems with no IOMMU and with >4GB RAM (in fact, whenever
> there are any pages mapped above 4GB), pci_alloc_consistent() falls
> back to using ZONE_DMA for all allocations, even if the device's
> dma_mask could have supported using memory from other zones. Problems
> can be seen when other ZONE_DMA users (SWIOTLB, scsi_malloc()) consume
> all of ZONE_DMA, leaving none left for pci_alloc_consistent() use.

scsi_malloc no longer uses ZONE_DMA nowadays....


2005-03-19 14:16:41

by Matt Domsch

[permalink] [raw]
Subject: Re: [PATCH 2.4.30-pre3] x86_64: pci_alloc_consistent() match 2.6 implementation

On Sat, Mar 19, 2005 at 07:09:45AM +0100, Arjan van de Ven wrote:
> On Fri, 2005-03-18 at 15:23 -0600, Matt Domsch wrote:
> > For review and comment.
> >
> > On x86_64 systems with no IOMMU and with >4GB RAM (in fact, whenever
> > there are any pages mapped above 4GB), pci_alloc_consistent() falls
> > back to using ZONE_DMA for all allocations, even if the device's
> > dma_mask could have supported using memory from other zones. Problems
> > can be seen when other ZONE_DMA users (SWIOTLB, scsi_malloc()) consume
> > all of ZONE_DMA, leaving none left for pci_alloc_consistent() use.
>
> scsi_malloc no longer uses ZONE_DMA nowadays....

In 2.4.x it does. scsi_resize_dma_pool() has:
__get_free_pages(GFP_ATOMIC | GFP_DMA, 0);
scsi_init_minimal_dma_pool() has similar.


--
Matt Domsch
Software Architect
Dell Linux Solutions linux.dell.com & http://www.dell.com/linux
Linux on Dell mailing lists @ http://lists.us.dell.com

2005-03-19 16:29:21

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [PATCH 2.4.30-pre3] x86_64: pci_alloc_consistent() match 2.6 implementation

On Sat, 2005-03-19 at 08:16 -0600, Matt Domsch wrote:
> On Sat, Mar 19, 2005 at 07:09:45AM +0100, Arjan van de Ven wrote:
> > On Fri, 2005-03-18 at 15:23 -0600, Matt Domsch wrote:
> > > For review and comment.
> > >
> > > On x86_64 systems with no IOMMU and with >4GB RAM (in fact, whenever
> > > there are any pages mapped above 4GB), pci_alloc_consistent() falls
> > > back to using ZONE_DMA for all allocations, even if the device's
> > > dma_mask could have supported using memory from other zones. Problems
> > > can be seen when other ZONE_DMA users (SWIOTLB, scsi_malloc()) consume
> > > all of ZONE_DMA, leaving none left for pci_alloc_consistent() use.
> >
> > scsi_malloc no longer uses ZONE_DMA nowadays....
>
> In 2.4.x it does. scsi_resize_dma_pool() has:
> __get_free_pages(GFP_ATOMIC | GFP_DMA, 0);
> scsi_init_minimal_dma_pool() has similar.
>

oh you want to do major changes to the 2.4 tree... sounds like a bad
idea to change such vm behavior..


2005-03-19 19:26:54

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 2.4.30-pre3] x86_64: pci_alloc_consistent() match 2.6 implementation

On Fri, Mar 18, 2005 at 03:23:44PM -0600, Matt Domsch wrote:
> For review and comment.
>
> On x86_64 systems with no IOMMU and with >4GB RAM (in fact, whenever
> there are any pages mapped above 4GB), pci_alloc_consistent() falls
> back to using ZONE_DMA for all allocations, even if the device's
> dma_mask could have supported using memory from other zones. Problems
> can be seen when other ZONE_DMA users (SWIOTLB, scsi_malloc()) consume
> all of ZONE_DMA, leaving none left for pci_alloc_consistent() use.
>
> Patch below makes pci_alloc_consistent() for the nommu case (EM64T
> processors) match the 2.6 implementation of dma_alloc_coherent(), with
> the exception that this continues to use GFP_ATOMIC.

You fixed the wrong code. The pci-nommu code is only used
when IOMMU is disabled in the Kconfig. But most kernels have
it enabled. You would need to change it in pci-gart.c too.

The reason it is like this that nommu was always intended as a hackish kludge
that would be only used for debugging - little did we know that
it would become standard later.

-Andi

2005-03-19 22:18:32

by Matt Domsch

[permalink] [raw]
Subject: Re: [PATCH 2.4.30-pre3] x86_64: pci_alloc_consistent() match 2.6 implementation

On Sat, Mar 19, 2005 at 08:26:45PM +0100, Andi Kleen wrote:
> On Fri, Mar 18, 2005 at 03:23:44PM -0600, Matt Domsch wrote:
> > For review and comment.
> >
> > On x86_64 systems with no IOMMU and with >4GB RAM (in fact, whenever
> > there are any pages mapped above 4GB), pci_alloc_consistent() falls
> > back to using ZONE_DMA for all allocations, even if the device's
> > dma_mask could have supported using memory from other zones. Problems
> > can be seen when other ZONE_DMA users (SWIOTLB, scsi_malloc()) consume
> > all of ZONE_DMA, leaving none left for pci_alloc_consistent() use.
> >
> > Patch below makes pci_alloc_consistent() for the nommu case (EM64T
> > processors) match the 2.6 implementation of dma_alloc_coherent(), with
> > the exception that this continues to use GFP_ATOMIC.
>
> You fixed the wrong code. The pci-nommu code is only used
> when IOMMU is disabled in the Kconfig. But most kernels have
> it enabled. You would need to change it in pci-gart.c too.

OK, then how's this for review? Compiles clean, can't test it myself
for a few days.

--
Matt Domsch
Software Architect
Dell Linux Solutions linux.dell.com & http://www.dell.com/linux
Linux on Dell mailing lists @ http://lists.us.dell.com

===== arch/x86_64/kernel/pci-gart.c 1.12 vs edited =====
--- 1.12/arch/x86_64/kernel/pci-gart.c 2004-06-03 05:29:36 -05:00
+++ edited/arch/x86_64/kernel/pci-gart.c 2005-03-19 15:56:34 -06:00
@@ -154,27 +154,37 @@ void *pci_alloc_consistent(struct pci_de
int gfp = GFP_ATOMIC;
int i;
unsigned long iommu_page;
+ dma_addr_t dma_mask;

- if (hwdev == NULL || hwdev->dma_mask < 0xffffffff || no_iommu)
+ if (hwdev == NULL || hwdev->dma_mask < 0xffffffff)
gfp |= GFP_DMA;

+ dma_mask = hwdev ? hwdev->dma_mask : 0xffffffffULL;
+ if (dma_mask == 0)
+ dma_mask = 0xffffffffULL;
+
/*
- * First try to allocate continuous and use directly if already
- * in lowmem.
+ * First try to allocate continuous and use directly if
+ * our device supports it
*/
size = round_up(size, PAGE_SIZE);
+ again:
memory = (void *)__get_free_pages(gfp, get_order(size));
if (memory == NULL) {
return NULL;
} else {
- int high = 0, mmu;
- if (((unsigned long)virt_to_bus(memory) + size) > 0xffffffffUL)
- high = 1;
- mmu = high;
+ int high = (((unsigned long)virt_to_bus(memory) + size) & ~dma_mask) != 0;
+ int mmu = high;
if (force_mmu && !(gfp & GFP_DMA))
mmu = 1;
if (no_iommu) {
- if (high) goto error;
+ if (high && (gfp & GFP_DMA))
+ goto error;
+ if (high) {
+ free_pages((unsigned long)memory, get_order(size));
+ gfp |= GFP_DMA;
+ goto again;
+ }
mmu = 0;
}
memset(memory, 0, size);

2005-03-22 21:56:21

by Suresh Siddha

[permalink] [raw]
Subject: Re: [PATCH 2.4.30-pre3] x86_64: pci_alloc_consistent() match 2.6 implementation

On Sat, Mar 19, 2005 at 04:17:51PM -0600, Matt Domsch wrote:
> OK, then how's this for review? Compiles clean, can't test it myself
> for a few days.
>
> - int high = 0, mmu;
> - if (((unsigned long)virt_to_bus(memory) + size) > 0xffffffffUL)
> - high = 1;
> - mmu = high;
> + int high = (((unsigned long)virt_to_bus(memory) + size) & ~dma_mask) != 0;
> + int mmu = high;

Documentation/DMA-mapping.txt says consistent DMA mapping interface will always
return SAC addressable DMA address. Your patch breaks this behavior.
(Though I don't know the reason why this behavior is expected!)

Appended is a simple 2.4 patch which will sync the behavior with 2.6

thanks,
suresh
--

Sync 2.4 pci_alloc_consistent behavior with 2.6

Signed-off-by: Suresh Siddha <[email protected]>


diff -Nru linux-2.4.29/arch/ia64/lib/swiotlb.c linux-2.4.29-swiotlb/arch/ia64/lib/swiotlb.c
--- linux-2.4.29/arch/ia64/lib/swiotlb.c 2003-08-25 04:44:39.000000000 -0700
+++ linux-2.4.29-swiotlb/arch/ia64/lib/swiotlb.c 2005-03-22 10:51:21.968565920 -0800
@@ -50,13 +50,13 @@
* Used to do a quick range check in swiotlb_unmap_single and swiotlb_sync_single, to see
* if the memory was in fact allocated by this API.
*/
-static char *io_tlb_start, *io_tlb_end;
+char *io_tlb_start, *io_tlb_end;

/*
* The number of IO TLB blocks (in groups of 64) betweeen io_tlb_start and io_tlb_end.
* This is command line adjustable via setup_io_tlb_npages.
*/
-static unsigned long io_tlb_nslabs = 1024;
+static unsigned long io_tlb_nslabs = 32768;

/*
* This is a free list describing the number of free entries available from each index
diff -Nru linux-2.4.29/arch/x86_64/kernel/pci-gart.c linux-2.4.29-swiotlb/arch/x86_64/kernel/pci-gart.c
--- linux-2.4.29/arch/x86_64/kernel/pci-gart.c 2004-08-07 16:26:04.000000000 -0700
+++ linux-2.4.29-swiotlb/arch/x86_64/kernel/pci-gart.c 2005-03-22 10:38:45.211610464 -0800
@@ -155,7 +155,7 @@
int i;
unsigned long iommu_page;

- if (hwdev == NULL || hwdev->dma_mask < 0xffffffff || no_iommu)
+ if (hwdev == NULL || hwdev->dma_mask < 0xffffffff || (no_iommu && !swiotlb))
gfp |= GFP_DMA;

/*
@@ -174,6 +174,22 @@
if (force_mmu && !(gfp & GFP_DMA))
mmu = 1;
if (no_iommu) {
+#ifdef CONFIG_SWIOTLB
+ if (swiotlb && high && hwdev) {
+ unsigned long dma_mask = 0;
+ if (hwdev->dma_mask == ~0UL) {
+ hwdev->dma_mask = 0xffffffff;
+ dma_mask = ~0UL;
+ }
+ *dma_handle = swiotlb_map_single(hwdev, memory, size,
+ PCI_DMA_FROMDEVICE);
+ if (dma_mask)
+ hwdev->dma_mask = dma_mask;
+ memset(phys_to_virt(*dma_handle), 0, size);
+ free_pages((unsigned long)memory, get_order(size));
+ return phys_to_virt(*dma_handle);
+ }
+#endif
if (high) goto error;
mmu = 0;
}
@@ -218,8 +234,16 @@
void *vaddr, dma_addr_t bus)
{
unsigned long iommu_page;
-
+ extern char *io_tlb_start, *io_tlb_end;
+
size = round_up(size, PAGE_SIZE);
+#ifdef CONFIG_SWIOTLB
+ if (swiotlb && vaddr >= (void *)io_tlb_start &&
+ vaddr < (void *)io_tlb_end) {
+ swiotlb_unmap_single (hwdev, bus, size, PCI_DMA_TODEVICE);
+ return;
+ }
+#endif
if (bus >= iommu_bus_base && bus < iommu_bus_base + iommu_size) {
unsigned pages = size >> PAGE_SHIFT;
iommu_page = (bus - iommu_bus_base) >> PAGE_SHIFT;