Hey Roland (and any iommu/ppc/dma experts out there):
I'm debugging a data corruption issue that happens on PPC64 systems
running rdma on kernels where the iommu page size is 4KB yet the host
page size is 64KB. This "feature" was added to the PPC64 code recently,
and is in kernel.org from 2.6.23. So if the kernel is built with a 4KB
page size, no problems. If the kernel is prior to 2.6.23 then 64KB page
configs work too. Its just a problem when the iommu page size != host
page size.
It appears that my problem boils down to a single host page of memory
that is mapped for dma, and the dma address returned by dma_map_sg() is
_not_ 64KB aligned. Here is an example:
app registers va 0x000000002d9a3000 len 12288
ib_umem_get() creates and maps a umem and chunk that looks like (dumping
state from a registered user memory region):
> umem len 12288 off 12288 pgsz 65536 shift 16
> chunk 0: nmap 1 nents 1
> sglist[0] page 0xc000000000930b08 off 0 len 65536 dma_addr 000000005bff4000 dma_len 65536
>
So the kernel maps 1 full page for this MR. But note that the dma
address is 000000005bff4000 which is 4KB aligned, not 64KB aligned. I
think this is causing grief to the RDMA HW.
My first question is: Is there an assumption or requirement in linux
that dma_addressess should have the same alignment as the host address
they are mapped to? IE the rdma core is mapping the entire 64KB page,
but the mapping doesn't begin on a 64KB page boundary.
If this mapping is considered valid, then perhaps the rdma hw is at
fault here. But I'm wondering if this is an PPC/iommu bug.
BTW: Here is what the Memory Region looks like to the HW:
> TPT entry: stag idx 0x2e800 key 0xff state VAL type NSMR pdid 0x2
> perms RW rem_inv_dis 0 addr_type VATO
> bind_enable 1 pg_size 65536 qpid 0x0 pbl_addr 0x003c67c0
> len 12288 va 000000002d9a3000 bind_cnt 0
> PBL: 000000005bff4000
Any thoughts?
Steve.
On Thu, 2007-12-20 at 11:14 -0600, Steve Wise wrote:
> Hey Roland (and any iommu/ppc/dma experts out there):
>
> I'm debugging a data corruption issue that happens on PPC64 systems
> running rdma on kernels where the iommu page size is 4KB yet the host
> page size is 64KB. This "feature" was added to the PPC64 code recently,
> and is in kernel.org from 2.6.23. So if the kernel is built with a 4KB
> page size, no problems. If the kernel is prior to 2.6.23 then 64KB page
> configs work too. Its just a problem when the iommu page size != host
> page size.
>
> It appears that my problem boils down to a single host page of memory
> that is mapped for dma, and the dma address returned by dma_map_sg() is
> _not_ 64KB aligned. Here is an example:
>
> app registers va 0x000000002d9a3000 len 12288
> ib_umem_get() creates and maps a umem and chunk that looks like (dumping
> state from a registered user memory region):
>
> > umem len 12288 off 12288 pgsz 65536 shift 16
> > chunk 0: nmap 1 nents 1
> > sglist[0] page 0xc000000000930b08 off 0 len 65536 dma_addr 000000005bff4000 dma_len 65536
> >
>
> So the kernel maps 1 full page for this MR. But note that the dma
> address is 000000005bff4000 which is 4KB aligned, not 64KB aligned. I
> think this is causing grief to the RDMA HW.
>
> My first question is: Is there an assumption or requirement in linux
> that dma_addressess should have the same alignment as the host address
> they are mapped to? IE the rdma core is mapping the entire 64KB page,
> but the mapping doesn't begin on a 64KB page boundary.
>
> If this mapping is considered valid, then perhaps the rdma hw is at
> fault here. But I'm wondering if this is an PPC/iommu bug.
>
> BTW: Here is what the Memory Region looks like to the HW:
>
> > TPT entry: stag idx 0x2e800 key 0xff state VAL type NSMR pdid 0x2
> > perms RW rem_inv_dis 0 addr_type VATO
> > bind_enable 1 pg_size 65536 qpid 0x0 pbl_addr 0x003c67c0
> > len 12288 va 000000002d9a3000 bind_cnt 0
> > PBL: 000000005bff4000
>
>
>
> Any thoughts?
The Ammasso certainly works this way. If you tell it the page size is
64KB, it will ignore bits in the page address that encode 0-65535.
>
> Steve.
>
>
> _______________________________________________
> general mailing list
> [email protected]
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> It appears that my problem boils down to a single host page of memory
> that is mapped for dma, and the dma address returned by dma_map_sg()
> is _not_ 64KB aligned. Here is an example:
> My first question is: Is there an assumption or requirement in linux
> that dma_addressess should have the same alignment as the host address
> they are mapped to? IE the rdma core is mapping the entire 64KB page,
> but the mapping doesn't begin on a 64KB page boundary.
I don't think this is explicitly documented anywhere, but it certainly
seems that we want the bus address to be page-aligned in this case.
For mthca/mlx4 at least, we tell the adapter what the host page size
is (so that it knows how to align doorbell pages etc) and I think this
sort of thing would confuse the HW.
- R.
Roland Dreier wrote:
> > It appears that my problem boils down to a single host page of memory
> > that is mapped for dma, and the dma address returned by dma_map_sg()
> > is _not_ 64KB aligned. Here is an example:
>
> > My first question is: Is there an assumption or requirement in linux
> > that dma_addressess should have the same alignment as the host address
> > they are mapped to? IE the rdma core is mapping the entire 64KB page,
> > but the mapping doesn't begin on a 64KB page boundary.
>
> I don't think this is explicitly documented anywhere, but it certainly
> seems that we want the bus address to be page-aligned in this case.
> For mthca/mlx4 at least, we tell the adapter what the host page size
> is (so that it knows how to align doorbell pages etc) and I think this
> sort of thing would confuse the HW.
>
> - R.
In arch/powerpc/kernel/iommu.c:iommu_map_sg() I see that it calls
iommu_range_alloc() with a alignment_order of 0:
> vaddr = (unsigned long)page_address(s->page) + s->offset;
> npages = iommu_num_pages(vaddr, slen);
> entry = iommu_range_alloc(tbl, npages, &handle, mask >> IOMMU_PAGE_SHIFT, 0);
But perhaps the alignment order needs to be based on the host page size?
Steve.
Steve Wise wrote:
> Roland Dreier wrote:
>> > It appears that my problem boils down to a single host page of memory
>> > that is mapped for dma, and the dma address returned by dma_map_sg()
>> > is _not_ 64KB aligned. Here is an example:
>>
>> > My first question is: Is there an assumption or requirement in linux
>> > that dma_addressess should have the same alignment as the host address
>> > they are mapped to? IE the rdma core is mapping the entire 64KB page,
>> > but the mapping doesn't begin on a 64KB page boundary.
>>
>> I don't think this is explicitly documented anywhere, but it certainly
>> seems that we want the bus address to be page-aligned in this case.
>> For mthca/mlx4 at least, we tell the adapter what the host page size
>> is (so that it knows how to align doorbell pages etc) and I think this
>> sort of thing would confuse the HW.
>>
>> - R.
>
>
> In arch/powerpc/kernel/iommu.c:iommu_map_sg() I see that it calls
> iommu_range_alloc() with a alignment_order of 0:
>
>> vaddr = (unsigned long)page_address(s->page) + s->offset;
>> npages = iommu_num_pages(vaddr, slen);
>> entry = iommu_range_alloc(tbl, npages, &handle, mask
>> >> IOMMU_PAGE_SHIFT, 0);
>
> But perhaps the alignment order needs to be based on the host page size?
>
Or based on the alignment of vaddr actually...
Adding A few more people to the discussion. You may well be right and we
would have to provide the same alignment, though that sucks a bit as one
of the reason we switched to 4K for the IOMMU is that the iommu space
available on pSeries is very small and we were running out of it with
64K pages and lots of networking activity.
On Thu, 2007-12-20 at 11:14 -0600, Steve Wise wrote:
> Hey Roland (and any iommu/ppc/dma experts out there):
>
> I'm debugging a data corruption issue that happens on PPC64 systems
> running rdma on kernels where the iommu page size is 4KB yet the host
> page size is 64KB. This "feature" was added to the PPC64 code recently,
> and is in kernel.org from 2.6.23. So if the kernel is built with a 4KB
> page size, no problems. If the kernel is prior to 2.6.23 then 64KB page
> configs work too. Its just a problem when the iommu page size != host
> page size.
>
> It appears that my problem boils down to a single host page of memory
> that is mapped for dma, and the dma address returned by dma_map_sg() is
> _not_ 64KB aligned. Here is an example:
>
> app registers va 0x000000002d9a3000 len 12288
> ib_umem_get() creates and maps a umem and chunk that looks like (dumping
> state from a registered user memory region):
>
> > umem len 12288 off 12288 pgsz 65536 shift 16
> > chunk 0: nmap 1 nents 1
> > sglist[0] page 0xc000000000930b08 off 0 len 65536 dma_addr 000000005bff4000 dma_len 65536
> >
>
> So the kernel maps 1 full page for this MR. But note that the dma
> address is 000000005bff4000 which is 4KB aligned, not 64KB aligned. I
> think this is causing grief to the RDMA HW.
>
> My first question is: Is there an assumption or requirement in linux
> that dma_addressess should have the same alignment as the host address
> they are mapped to? IE the rdma core is mapping the entire 64KB page,
> but the mapping doesn't begin on a 64KB page boundary.
>
> If this mapping is considered valid, then perhaps the rdma hw is at
> fault here. But I'm wondering if this is an PPC/iommu bug.
>
> BTW: Here is what the Memory Region looks like to the HW:
>
> > TPT entry: stag idx 0x2e800 key 0xff state VAL type NSMR pdid 0x2
> > perms RW rem_inv_dis 0 addr_type VATO
> > bind_enable 1 pg_size 65536 qpid 0x0 pbl_addr 0x003c67c0
> > len 12288 va 000000002d9a3000 bind_cnt 0
> > PBL: 000000005bff4000
>
>
>
> Any thoughts?
>
> Steve.
>
On Thu, 2007-12-20 at 13:29 -0600, Steve Wise wrote:
> Or based on the alignment of vaddr actually...
The later wouldn't be realistic. What I think might be necessay, though
it would definitely cause us problems with running out of iommu space
(which is the reason we did the switch down to 4K), is to provide
alignment to the real page size, and alignement to the allocation order
for dma_map_consistent.
It might be possible to -tweak- and only provide alignment to the page
size for allocations that are larger than IOMMU_PAGE_SIZE. That would
solve the problem with small network packets eating up too much iommu
space though.
What do you think ?
Ben.
Benjamin Herrenschmidt wrote:
> Adding A few more people to the discussion. You may well be right and we
> would have to provide the same alignment, though that sucks a bit as one
> of the reason we switched to 4K for the IOMMU is that the iommu space
> available on pSeries is very small and we were running out of it with
> 64K pages and lots of networking activity.
>
But smarter NIC drivers can resolve this too, I think, but perhaps
carving up full pages of mapped buffers instead of just assuming mapping
is free...
perhaps...
Benjamin Herrenschmidt wrote:
> On Thu, 2007-12-20 at 13:29 -0600, Steve Wise wrote:
>
>> Or based on the alignment of vaddr actually...
>
> The later wouldn't be realistic. What I think might be necessay, though
> it would definitely cause us problems with running out of iommu space
> (which is the reason we did the switch down to 4K), is to provide
> alignment to the real page size, and alignement to the allocation order
> for dma_map_consistent.
>
> It might be possible to -tweak- and only provide alignment to the page
> size for allocations that are larger than IOMMU_PAGE_SIZE. That would
> solve the problem with small network packets eating up too much iommu
> space though.
>
> What do you think ?
That might work.
If you gimme a patch, i'll try it out!
Steve.
On Thu, 2007-12-20 at 15:02 -0600, Steve Wise wrote:
> Benjamin Herrenschmidt wrote:
> > Adding A few more people to the discussion. You may well be right and we
> > would have to provide the same alignment, though that sucks a bit as one
> > of the reason we switched to 4K for the IOMMU is that the iommu space
> > available on pSeries is very small and we were running out of it with
> > 64K pages and lots of networking activity.
> >
>
> But smarter NIC drivers can resolve this too, I think, but perhaps
> carving up full pages of mapped buffers instead of just assuming mapping
> is free...
True, but the problem still happenens today, if we switch back to 64K
iommu page size (which should be possible, I need to fix that), we
-will- run out of iommu space on typical workloads and that is not
acceptable.
So we need to find a compromise.
What I might do is something around the lines of: If size >= PAGE_SIZE,
and vaddr (page_address + offset) is PAGE_SIZE aligned, then I enforce
alignment of the resulting mapping.
That should fix your case. Anything requesting smaller than PAGE_SIZE
mappings would lose that alignment but I -think- it should be safe, and
you still always get 4K alignment anyway (+/- your offset) so at least
small alignment restrictions are still enforced (such as cache line
alignment etc...).
I'll send you a test patch later today.
Ben.
Benjamin Herrenschmidt wrote:
> On Thu, 2007-12-20 at 15:02 -0600, Steve Wise wrote:
>> Benjamin Herrenschmidt wrote:
>>> Adding A few more people to the discussion. You may well be right and we
>>> would have to provide the same alignment, though that sucks a bit as one
>>> of the reason we switched to 4K for the IOMMU is that the iommu space
>>> available on pSeries is very small and we were running out of it with
>>> 64K pages and lots of networking activity.
>>>
>> But smarter NIC drivers can resolve this too, I think, but perhaps
>> carving up full pages of mapped buffers instead of just assuming mapping
>> is free...
>
> True, but the problem still happenens today, if we switch back to 64K
> iommu page size (which should be possible, I need to fix that), we
> -will- run out of iommu space on typical workloads and that is not
> acceptable.
>
> So we need to find a compromise.
>
> What I might do is something around the lines of: If size >= PAGE_SIZE,
> and vaddr (page_address + offset) is PAGE_SIZE aligned, then I enforce
> alignment of the resulting mapping.
>
> That should fix your case. Anything requesting smaller than PAGE_SIZE
> mappings would lose that alignment but I -think- it should be safe, and
> you still always get 4K alignment anyway (+/- your offset) so at least
> small alignment restrictions are still enforced (such as cache line
> alignment etc...).
>
> I'll send you a test patch later today.
>
> Ben.
>
Sounds good. Thanks!
Note, that these smaller sub-host-page-sized mappings might pollute the
address space causing full aligned host-page-size maps to become
scarce... Maybe there's a clever way to keep those in their own segment
of the address space?
> Sounds good. Thanks!
>
> Note, that these smaller sub-host-page-sized mappings might pollute the
> address space causing full aligned host-page-size maps to become
> scarce... Maybe there's a clever way to keep those in their own segment
> of the address space?
We already have a large vs. small split in the iommu virtual space to
alleviate this (though it's not a hard constraint, we can still get
into the "other" side if the default one is full).
Try that patch and let me know:
Index: linux-work/arch/powerpc/kernel/iommu.c
===================================================================
--- linux-work.orig/arch/powerpc/kernel/iommu.c 2007-12-21 10:39:39.000000000 +1100
+++ linux-work/arch/powerpc/kernel/iommu.c 2007-12-21 10:46:18.000000000 +1100
@@ -278,6 +278,7 @@ int iommu_map_sg(struct iommu_table *tbl
unsigned long flags;
struct scatterlist *s, *outs, *segstart;
int outcount, incount, i;
+ unsigned int align;
unsigned long handle;
BUG_ON(direction == DMA_NONE);
@@ -309,7 +310,11 @@ int iommu_map_sg(struct iommu_table *tbl
/* Allocate iommu entries for that segment */
vaddr = (unsigned long) sg_virt(s);
npages = iommu_num_pages(vaddr, slen);
- entry = iommu_range_alloc(tbl, npages, &handle, mask >> IOMMU_PAGE_SHIFT, 0);
+ align = 0;
+ if (IOMMU_PAGE_SHIFT < PAGE_SHIFT && (vaddr & ~PAGE_MASK) == 0)
+ align = PAGE_SHIFT - IOMMU_PAGE_SHIFT;
+ entry = iommu_range_alloc(tbl, npages, &handle,
+ mask >> IOMMU_PAGE_SHIFT, align);
DBG(" - vaddr: %lx, size: %lx\n", vaddr, slen);
@@ -572,7 +577,7 @@ dma_addr_t iommu_map_single(struct iommu
{
dma_addr_t dma_handle = DMA_ERROR_CODE;
unsigned long uaddr;
- unsigned int npages;
+ unsigned int npages, align;
BUG_ON(direction == DMA_NONE);
@@ -580,8 +585,13 @@ dma_addr_t iommu_map_single(struct iommu
npages = iommu_num_pages(uaddr, size);
if (tbl) {
+ align = 0;
+ if (IOMMU_PAGE_SHIFT < PAGE_SHIFT &&
+ ((unsigned long)vaddr & ~PAGE_MASK) == 0)
+ align = PAGE_SHIFT - IOMMU_PAGE_SHIFT;
+
dma_handle = iommu_alloc(tbl, vaddr, npages, direction,
- mask >> IOMMU_PAGE_SHIFT, 0);
+ mask >> IOMMU_PAGE_SHIFT, align);
if (dma_handle == DMA_ERROR_CODE) {
if (printk_ratelimit()) {
printk(KERN_INFO "iommu_alloc failed, "
Benjamin Herrenschmidt wrote:
>> Sounds good. Thanks!
>>
>> Note, that these smaller sub-host-page-sized mappings might pollute the
>> address space causing full aligned host-page-size maps to become
>> scarce... Maybe there's a clever way to keep those in their own segment
>> of the address space?
>
> We already have a large vs. small split in the iommu virtual space to
> alleviate this (though it's not a hard constraint, we can still get
> into the "other" side if the default one is full).
>
> Try that patch and let me know:
Seems to be working!
:)
>
> Index: linux-work/arch/powerpc/kernel/iommu.c
> ===================================================================
> --- linux-work.orig/arch/powerpc/kernel/iommu.c 2007-12-21 10:39:39.000000000 +1100
> +++ linux-work/arch/powerpc/kernel/iommu.c 2007-12-21 10:46:18.000000000 +1100
> @@ -278,6 +278,7 @@ int iommu_map_sg(struct iommu_table *tbl
> unsigned long flags;
> struct scatterlist *s, *outs, *segstart;
> int outcount, incount, i;
> + unsigned int align;
> unsigned long handle;
>
> BUG_ON(direction == DMA_NONE);
> @@ -309,7 +310,11 @@ int iommu_map_sg(struct iommu_table *tbl
> /* Allocate iommu entries for that segment */
> vaddr = (unsigned long) sg_virt(s);
> npages = iommu_num_pages(vaddr, slen);
> - entry = iommu_range_alloc(tbl, npages, &handle, mask >> IOMMU_PAGE_SHIFT, 0);
> + align = 0;
> + if (IOMMU_PAGE_SHIFT < PAGE_SHIFT && (vaddr & ~PAGE_MASK) == 0)
> + align = PAGE_SHIFT - IOMMU_PAGE_SHIFT;
> + entry = iommu_range_alloc(tbl, npages, &handle,
> + mask >> IOMMU_PAGE_SHIFT, align);
>
> DBG(" - vaddr: %lx, size: %lx\n", vaddr, slen);
>
> @@ -572,7 +577,7 @@ dma_addr_t iommu_map_single(struct iommu
> {
> dma_addr_t dma_handle = DMA_ERROR_CODE;
> unsigned long uaddr;
> - unsigned int npages;
> + unsigned int npages, align;
>
> BUG_ON(direction == DMA_NONE);
>
> @@ -580,8 +585,13 @@ dma_addr_t iommu_map_single(struct iommu
> npages = iommu_num_pages(uaddr, size);
>
> if (tbl) {
> + align = 0;
> + if (IOMMU_PAGE_SHIFT < PAGE_SHIFT &&
> + ((unsigned long)vaddr & ~PAGE_MASK) == 0)
> + align = PAGE_SHIFT - IOMMU_PAGE_SHIFT;
> +
> dma_handle = iommu_alloc(tbl, vaddr, npages, direction,
> - mask >> IOMMU_PAGE_SHIFT, 0);
> + mask >> IOMMU_PAGE_SHIFT, align);
> if (dma_handle == DMA_ERROR_CODE) {
> if (printk_ratelimit()) {
> printk(KERN_INFO "iommu_alloc failed, "
>
BTW. I need to know urgently what HW is broken by this
Ben.