Hi,
While debugging DMA mapping error in chelsio crypto driver we observed that when scatter/gather list received by driver has some entry with page->offset > 4096 (PAGE_SIZE). It starts giving DMA error. Without IOMMU it works fine.
Before reaching to chelsio crypto driver(driver/crypto/chelsio) following entities change the sg'
1) IN esp_output() "__skb_to_sgvec()" convert skb frags to scatter gather list. At that moment sg->offset was 4094.
2) From esp_output control reaches to "crypto_authenc_encrypt()". Here in "scatterwalk_ffwd()" sg->offset become 4110.
3) Same sg list received by chelsio crypto driver(chcr). When chcr try to do DMA mapping it starts giving DMA errors.
Following error observed. first two prints are added for debugging in chcr. Kernel version used to reproduce is 4.9.28 on x86_64.
Sep 15 12:40:52 heptagon kernel: process_cipher req src ffff8803cb41f0a8
Sep 15 12:40:52 heptagon kernel: ========= issue hit offset:4110 ======= dma_addr f24b000e ==> DMA mapped address returned by dma_map_sg()
Sep 15 12:40:52 heptagon kernel: DMAR: DRHD: handling fault status reg 2
Sep 15 12:40:52 heptagon kernel: DMAR: [DMA Write] Request device [02:00.4] fault addr f24b0000 [fault reason 05] PTE Write access is not set
By applying following hack in kernel. Things start working.
diff --git a/crypto/scatterwalk.c b/crypto/scatterwalk.c
index c16c94f8..1d75a3a 100644
--- a/crypto/scatterwalk.c
+++ b/crypto/scatterwalk.c
@@ -78,6 +78,8 @@ struct scatterlist *scatterwalk_ffwd(struct scatterlist dst[2]
struct scatterlist *src,
unsigned int len)
{
+ unsigned int mod_page_offset;
+
for (;;) {
if (!len)
return src;
@@ -90,7 +92,9 @@ struct scatterlist *scatterwalk_ffwd(struct scatterlist dst[2]
}
sg_init_table(dst, 2);
- sg_set_page(dst, sg_page(src), src->length - len, src->offset + len);
+ mod_page_offset = (src->offset + len) / PAGE_SIZE;
+ sg_set_page(dst, sg_page(src) + mod_page_offset, src->length - len,
+ (src->offset + len) - (mod_page_offset * PAGE_SIZE));
scatterwalk_crypto_chain(dst, sg_next(src), 0, 2);
1) We are not expecting issue in "scatterwalk_ffwd" because it is not the only place where kernel
updates src->offset without checking page boundary. similar logic used in "__skb_to_sgvec".
2) It cannot be driver's responsibilty to update received sg entries to adjust offset and page
because we are not the only one who directly uses received sg list.
3) Since Without IOMMU every thing works fine. We are expecting IOMMU bugs.
Regards
Harsh Jain
Harsh Jain <[email protected]> wrote:
>
> While debugging DMA mapping error in chelsio crypto driver we observed that when scatter/gather list received by driver has some entry with page->offset > 4096 (PAGE_SIZE). It starts giving DMA error.? Without IOMMU it works fine.
This is not a bug. The network stack can and will feed us such
SG lists.
> 2) It cannot be driver's responsibilty to update received sg entries to adjust offset and page
> because we are not the only one who directly uses received sg list.
No the driver must deal with this. Having said that, if we can
improve our driver helper interface to make this easier then we
should do that too. What we certainly shouldn't do is to take a
whack-a-mole approach like this patch does.
Cheers,
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
On 20/09/17 09:01, Herbert Xu wrote:
> Harsh Jain <[email protected]> wrote:
>>
>> While debugging DMA mapping error in chelsio crypto driver we observed that when scatter/gather list received by driver has some entry with page->offset > 4096 (PAGE_SIZE). It starts giving DMA error. Without IOMMU it works fine.
>
> This is not a bug. The network stack can and will feed us such
> SG lists.
>
>> 2) It cannot be driver's responsibilty to update received sg entries to adjust offset and page
>> because we are not the only one who directly uses received sg list.
>
> No the driver must deal with this. Having said that, if we can
> improve our driver helper interface to make this easier then we
> should do that too. What we certainly shouldn't do is to take a
> whack-a-mole approach like this patch does.
AFAICS this is entirely on intel-iommu - from a brief look it appears
that all the IOVA calculations would handle the offset correctly, but
then __domain_mapping() blindly uses sg_page() for the physical address,
so if offset is larger than a page it would end up with the DMA mapping
covering the wrong part of the buffer.
Does the diff below help?
Robin.
----->8-----
diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index b3914fce8254..2ed43d928135 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -2253,7 +2253,7 @@ static int __domain_mapping(struct dmar_domain *domain, unsigned long iov_pfn,
sg_res = aligned_nrpages(sg->offset, sg->length);
sg->dma_address = ((dma_addr_t)iov_pfn << VTD_PAGE_SHIFT) + sg->offset;
sg->dma_length = sg->length;
- pteval = page_to_phys(sg_page(sg)) | prot;
+ pteval = (sg_phys(sg) & PAGE_MASK) | prot;
phys_pfn = pteval >> VTD_PAGE_SHIFT;
}
On 20-09-2017 15:42, Robin Murphy wrote:
> On 20/09/17 09:01, Herbert Xu wrote:
>> Harsh Jain <[email protected]> wrote:
>>> While debugging DMA mapping error in chelsio crypto driver we observed that when scatter/gather list received by driver has some entry with page->offset > 4096 (PAGE_SIZE). It starts giving DMA error. Without IOMMU it works fine.
>> This is not a bug. The network stack can and will feed us such
>> SG lists.
>>
>>> 2) It cannot be driver's responsibilty to update received sg entries to adjust offset and page
>>> because we are not the only one who directly uses received sg list.
>> No the driver must deal with this. Having said that, if we can
>> improve our driver helper interface to make this easier then we
>> should do that too. What we certainly shouldn't do is to take a
>> whack-a-mole approach like this patch does.
> AFAICS this is entirely on intel-iommu - from a brief look it appears
> that all the IOVA calculations would handle the offset correctly, but
> then __domain_mapping() blindly uses sg_page() for the physical address,
> so if offset is larger than a page it would end up with the DMA mapping
> covering the wrong part of the buffer.
>
> Does the diff below help?
>
> Robin.
>
> ----->8-----
> diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
> index b3914fce8254..2ed43d928135 100644
> --- a/drivers/iommu/intel-iommu.c
> +++ b/drivers/iommu/intel-iommu.c
> @@ -2253,7 +2253,7 @@ static int __domain_mapping(struct dmar_domain *domain, unsigned long iov_pfn,
> sg_res = aligned_nrpages(sg->offset, sg->length);
> sg->dma_address = ((dma_addr_t)iov_pfn << VTD_PAGE_SHIFT) + sg->offset;
> sg->dma_length = sg->length;
> - pteval = page_to_phys(sg_page(sg)) | prot;
> + pteval = (sg_phys(sg) & PAGE_MASK) | prot;
> phys_pfn = pteval >> VTD_PAGE_SHIFT;
> }
Robin,
Still having following error with above change.
[ 429.645492] DMAR: DRHD: handling fault status reg 2
[ 429.650847] DMAR: [DMA Write] Request device [02:00.4] fault addr f2682000 [t
>
On 20-09-2017 13:31, Herbert Xu wrote:
> Harsh Jain <[email protected]> wrote:
>> While debugging DMA mapping error in chelsio crypto driver we observed that when scatter/gather list received by driver has some entry with page->offset > 4096 (PAGE_SIZE). It starts giving DMA error. Without IOMMU it works fine.
> This is not a bug. The network stack can and will feed us such
> SG lists.
>
>> 2) It cannot be driver's responsibilty to update received sg entries to adjust offset and page
>> because we are not the only one who directly uses received sg list.
> No the driver must deal with this. Having said that, if we can
> improve our driver helper interface to make this easier then we
> should do that too. What we certainly shouldn't do is to take a
> whack-a-mole approach like this patch does.
Agreed,I added that patch for understanding purpose only. Today I referred other crypto driver for DMA related code. Most of them are using dma_map_sg except QAT. In QAT, They are first updating the Page address using offset then mapping each page in for loop with dma_map_single(0. I will try the same in chelsio driver will see the behavior.
>
> Cheers,
| From: Robin Murphy <[email protected]>
| Sent: Wednesday, September 20, 2017 3:12 AM
|
| On 20/09/17 09:01, Herbert Xu wrote:
| >
| > Harsh Jain <[email protected]> wrote:
| >>
| >> While debugging DMA mapping error in chelsio crypto driver we
| >> observed that when scatter/gather list received by driver has
| >> some entry with page->offset > 4096 (PAGE_SIZE). It starts
| >> giving DMA error. Without IOMMU it works fine.
| >
| > This is not a bug. The network stack can and will feed us such
| > SG lists.
| >
| >> 2) It cannot be driver's responsibilty to update received sg
| >> entries to adjust offset and page because we are not the only
| >> one who directly uses received sg list.
| >
| > No the driver must deal with this. Having said that, if we can
| > improve our driver helper interface to make this easier then we
| > should do that too. What we certainly shouldn't do is to take a
| > whack-a-mole approach like this patch does.
|
| AFAICS this is entirely on intel-iommu - from a brief look it appears
| that all the IOVA calculations would handle the offset correctly, but
| then __domain_mapping() blindly uses sg_page() for the physical address,
| so if offset is larger than a page it would end up with the DMA mapping
| covering the wrong part of the buffer.
|
| Does the diff below help?
|
| Robin.
|
| ----->8-----
| diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
| index b3914fce8254..2ed43d928135 100644
| --- a/drivers/iommu/intel-iommu.c
| +++ b/drivers/iommu/intel-iommu.c
| @@ -2253,7 +2253,7 @@ static int __domain_mapping(struct dmar_domain *domain, unsigned long iov_pfn,
| sg_res = aligned_nrpages(sg->offset, sg->length);
| sg->dma_address = ((dma_addr_t)iov_pfn << VTD_PAGE_SHIFT) + sg->offset;
| sg->dma_length = sg->length;
| - pteval = page_to_phys(sg_page(sg)) | prot;
| + pteval = (sg_phys(sg) & PAGE_MASK) | prot;
| phys_pfn = pteval >> VTD_PAGE_SHIFT;
| }
Adding some likely people to the Cc list so they can comment on this.
Dan Williams submitted that specific piece of code in kernel.org:3e6110fd54
... but there are lots of similar bits in that function. Hopefully one of
the Intel I/O MMU Gurus will have a better idea of what may be going wrong
here. In the mean time I've asked our team to gather far more detailed
debug traces showing the exact Scatter/Gather Lists we're getting, what they
get translated to in the DMA Mappings, and what DMA Addresses were seeing in
error.
Casey
Hi Casey
Sorry, somehow didn't see this one come by.
On Mon, Sep 25, 2017 at 05:46:40PM +0000, Casey Leedom wrote:
> | From: Robin Murphy <[email protected]>
> | Sent: Wednesday, September 20, 2017 3:12 AM
> |
> | On 20/09/17 09:01, Herbert Xu wrote:
> | >
> | > Harsh Jain <[email protected]> wrote:
> | >>
> | >> While debugging DMA mapping error in chelsio crypto driver we
> | >> observed that when scatter/gather list received by driver has
> | >> some entry with page->offset > 4096 (PAGE_SIZE). It starts
> | >> giving DMA error. Without IOMMU it works fine.
Not sure how the page->offset would end up being greater than page-size?
If you have additional traces, please send them by.
Is this a new driver? wondering how we didn't run into this?
Cheers,
Ashok
On Wed, 2017-09-20 at 16:01 +0800, Herbert Xu wrote:
> Harsh Jain <[email protected]> wrote:
> >
> > While debugging DMA mapping error in chelsio crypto driver we
> observed that when scatter/gather list received by driver has some
> entry with page->offset > 4096 (PAGE_SIZE). It starts giving DMA
> error. Without IOMMU it works fine.
>
> This is not a bug. The network stack can and will feed us such
> SG lists.
Hm? Under what circumstances is the offset permitted to be >
PAGE_SIZE?
| From: Raj, Ashok <[email protected]>
| Sent: Monday, September 25, 2017 8:54 AM
|
| Not sure how the page->offset would end up being greater than page-size?
|
| If you have additional traces, please send them by.
|
| Is this a new driver? wondering how we didn't run into this?
According to Herbert Xu and one of our own engineers, it's actually legal
for Scatter/Gather Lists to have this. This isn't my area of expertise
though so I'm just passing that on.
I've asked our team to produce a detailed trace of the exact
Scatter/Gather Lists they're seeing and what ends up coming out of the DMA
Mappings, etc. They're in India, so I expect that they'll have this for you
by tomorrow morning.
Casey
On Mon, Sep 25, 2017 at 10:46 AM, Casey Leedom <[email protected]> wrote:
> | From: Robin Murphy <[email protected]>
> | Sent: Wednesday, September 20, 2017 3:12 AM
> |
> | On 20/09/17 09:01, Herbert Xu wrote:
> | >
> | > Harsh Jain <[email protected]> wrote:
> | >>
> | >> While debugging DMA mapping error in chelsio crypto driver we
> | >> observed that when scatter/gather list received by driver has
> | >> some entry with page->offset > 4096 (PAGE_SIZE). It starts
> | >> giving DMA error. Without IOMMU it works fine.
> | >
> | > This is not a bug. The network stack can and will feed us such
> | > SG lists.
> | >
> | >> 2) It cannot be driver's responsibilty to update received sg
> | >> entries to adjust offset and page because we are not the only
> | >> one who directly uses received sg list.
> | >
> | > No the driver must deal with this. Having said that, if we can
> | > improve our driver helper interface to make this easier then we
> | > should do that too. What we certainly shouldn't do is to take a
> | > whack-a-mole approach like this patch does.
> |
> | AFAICS this is entirely on intel-iommu - from a brief look it appears
> | that all the IOVA calculations would handle the offset correctly, but
> | then __domain_mapping() blindly uses sg_page() for the physical address,
> | so if offset is larger than a page it would end up with the DMA mapping
> | covering the wrong part of the buffer.
> |
> | Does the diff below help?
> |
> | Robin.
> |
> | ----->8-----
> | diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
> | index b3914fce8254..2ed43d928135 100644
> | --- a/drivers/iommu/intel-iommu.c
> | +++ b/drivers/iommu/intel-iommu.c
> | @@ -2253,7 +2253,7 @@ static int __domain_mapping(struct dmar_domain *domain, unsigned long iov_pfn,
> | sg_res = aligned_nrpages(sg->offset, sg->length);
> | sg->dma_address = ((dma_addr_t)iov_pfn << VTD_PAGE_SHIFT) + sg->offset;
> | sg->dma_length = sg->length;
> | - pteval = page_to_phys(sg_page(sg)) | prot;
> | + pteval = (sg_phys(sg) & PAGE_MASK) | prot;
This breaks on platforms where sizeof(phys_addr_t) > sizeof(unsigned
long). I.e. it's not always safe to assume that PAGE_MASK is the
correct width.
> | phys_pfn = pteval >> VTD_PAGE_SHIFT;
> | }
>
> Adding some likely people to the Cc list so they can comment on this.
> Dan Williams submitted that specific piece of code in kernel.org:3e6110fd54
> ... but there are lots of similar bits in that function. Hopefully one of
> the Intel I/O MMU Gurus will have a better idea of what may be going wrong
> here. In the mean time I've asked our team to gather far more detailed
> debug traces showing the exact Scatter/Gather Lists we're getting, what they
> get translated to in the DMA Mappings, and what DMA Addresses were seeing in
> error.
IIUC it looks like this has been broken ever since commit e1605495c716
"intel-iommu: Introduce domain_sg_mapping() to speed up
intel_map_sg()". I.e. it looks like the calculation for pte_val should
be:
pteval = (page_to_phys(sg_page(sg)) + sg->offset) | prot;
| From: Dan Williams <[email protected]>
| Sent: Monday, September 25, 2017 12:31 PM
| ...
| IIUC it looks like this has been broken ever since commit e1605495c716
| "intel-iommu: Introduce domain_sg_mapping() to speed up
| intel_map_sg()". I.e. it looks like the calculation for pte_val should
| be:
|
| ?? pteval = (page_to_phys(sg_page(sg)) + sg->offset) | prot;
Hhmmm, shouldn't that be:
pteval = (page_to_phys(sg_page(sg)) + (sg->offset>>PAGE_SHIFT)) | prot;
???
Casey
On Mon, Sep 25, 2017 at 1:05 PM, Casey Leedom <[email protected]> wrote:
> | From: Dan Williams <[email protected]>
> | Sent: Monday, September 25, 2017 12:31 PM
> | ...
> | IIUC it looks like this has been broken ever since commit e1605495c716
> | "intel-iommu: Introduce domain_sg_mapping() to speed up
> | intel_map_sg()". I.e. it looks like the calculation for pte_val should
> | be:
> |
> | pteval = (page_to_phys(sg_page(sg)) + sg->offset) | prot;
>
> Hhmmm, shouldn't that be:
>
> pteval = (page_to_phys(sg_page(sg)) + (sg->offset>>PAGE_SHIFT)) | prot;
Yes, I think you're right. We do want to mask off the page-unaligned
portion of sg->offset.
| From: David Woodhouse <[email protected]>
| Sent: Monday, September 25, 2017 11:45 AM
|
| On Wed, 2017-09-20 at 16:01 +0800, Herbert Xu wrote:
| > Harsh Jain <[email protected]> wrote:
| > >
| > > While debugging DMA mapping error in chelsio crypto driver we
| > observed that when scatter/gather list received by driver has some
| > entry with page->offset > 4096 (PAGE_SIZE). It starts giving DMA
| > error. Without IOMMU it works fine.
| >
| > This is not a bug. The network stack can and will feed us such
| > SG lists.
|
| Hm? Under what circumstances is the offset permitted to be >
| PAGE_SIZE?
As I noted earlier, this is an area of the kernel with which I'm not super
familiar. Both Herbert Xu and our local VM Expert have said that having
Scatter/Gather Lists with Offsets greater than Page Size is not a bug ...
I'm mostly trying to help out keeping focus on this because Harsh is in
India (presumable enjoying a good night's sleep while we look at this.
Hopefully we'll have a present of a bug fix for him when he wakes up ... :-)
Casey
Hi
On Mon, Sep 25, 2017 at 01:11:04PM -0700, Dan Williams wrote:
> On Mon, Sep 25, 2017 at 1:05 PM, Casey Leedom <[email protected]> wrote:
> > | From: Dan Williams <[email protected]>
> > | Sent: Monday, September 25, 2017 12:31 PM
> > | ...
> > | IIUC it looks like this has been broken ever since commit e1605495c716
> > | "intel-iommu: Introduce domain_sg_mapping() to speed up
> > | intel_map_sg()". I.e. it looks like the calculation for pte_val should
> > | be:
> > |
> > | pteval = (page_to_phys(sg_page(sg)) + sg->offset) | prot;
> >
> > Hhmmm, shouldn't that be:
> >
> > pteval = (page_to_phys(sg_page(sg)) + (sg->offset>>PAGE_SHIFT)) | prot;
>
> Yes, I think you're right. We do want to mask off the page-unaligned
> portion of sg->offset.
Shoulnd't we normalize the entire sg_page(sg) + sg_offset.
if when you only mask the page-unaligned portion i suspect you might be
pointing to a different region?
something like (sg_page(sg) + (sg->offset << VTD_PAGE_SHIFT))
then add the unaligned part.. sg->offset>>VTD_PAGE_SHIFT
Is this happening because you are using a 2M page? not sure what triggers
this or causes the driver to get passed in larger than 4K offset, or
running 32bit kernel?
if its legal to get passed in such odd values, we should fix IOMMU driver to
handle it properly, otherwise we should atleast fail those requests.
Cheers,
Ashok
| From: Raj, Ashok <[email protected]>
| Sent: Monday, September 25, 2017 12:03 PM
|
| On Mon, Sep 25, 2017 at 01:11:04PM -0700, Dan Williams wrote:
| > On Mon, Sep 25, 2017 at 1:05 PM, Casey Leedom <[email protected]> wrote:
| > > | From: Dan Williams <[email protected]>
| > > | Sent: Monday, September 25, 2017 12:31 PM
| > > | ...
| > > | IIUC it looks like this has been broken ever since commit e1605495c716
| > > | "intel-iommu: Introduce domain_sg_mapping() to speed up
| > > | intel_map_sg()". I.e. it looks like the calculation for pte_val should
| > > | be:
| > > |
| > > | pteval = (page_to_phys(sg_page(sg)) + sg->offset) | prot;
| > >
| > > Hhmmm, shouldn't that be:
| > >
| > > pteval = (page_to_phys(sg_page(sg)) + (sg->offset>>PAGE_SHIFT)) | prot;
| >
| > Yes, I think you're right. We do want to mask off the page-unaligned
| > portion of sg->offset.
|
| Shoulnd't we normalize the entire sg_page(sg) + sg_offset.
|
| if when you only mask the page-unaligned portion i suspect you might be
| pointing to a different region?
|
| something like (sg_page(sg) + (sg->offset << VTD_PAGE_SHIFT))
|
| then add the unaligned part.. sg->offset>>VTD_PAGE_SHIFT
|
| Is this happening because you are using a 2M page? not sure what triggers
| this or causes the driver to get passed in larger than 4K offset, or
| running 32bit kernel?
|
| if its legal to get passed in such odd values, we should fix IOMMU driver to
| handle it properly, otherwise we should atleast fail those requests.
(woof) This is all above me. I've spent a chunk of time fruitlessly
trying to find documentation which says that scatterlist's are allowed to
have offset/length values which extend outside the sg_page(sg). So someone
much more familiar with this stuff is going to need to say what's allowed.
As I said, I've asked Harsh to provide us with a detailed trace of exactly
what he's seeing and what the Scatter/Gather Lists are getting translated
into. That information may make it easier to understand if/how
__domain_mapping() is screwing up ...
Casey
On 26-09-2017 00:16, Casey Leedom wrote:
> | From: Raj, Ashok <[email protected]>
> | Sent: Monday, September 25, 2017 8:54 AM
> |
> | Not sure how the page->offset would end up being greater than page-size?
Refer below
> |
> | If you have additional traces, please send them by.
> |
> | Is this a new driver? wondering how we didn't run into this?
>
> According to Herbert Xu and one of our own engineers, it's actually legal
> for Scatter/Gather Lists to have this. This isn't my area of expertise
> though so I'm just passing that on.
>
> I've asked our team to produce a detailed trace of the exact
> Scatter/Gather Lists they're seeing and what ends up coming out of the DMA
> Mappings, etc. They're in India, so I expect that they'll have this for you
> by tomorrow morning.
Below mentioned log was already there in 1st mail. Copied here for easy reference. Let me know if you need
additional traces.
1) IN esp_output() "__skb_to_sgvec()" convert skb frags to scatter gather list.
At that moment sg->offset was 4094.
2) From esp_output control reaches to "crypto_authenc_encrypt()". Here in
"scatterwalk_ffwd()" sg->offset become 4110.
3) Same sg list received by chelsio crypto driver(chcr). When chcr try to do
DMA mapping it starts giving DMA errors.
Following error observed. first two prints are added for debugging in chcr.
Kernel version used to reproduce is 4.9.28 on x86_64 with Page size 4K.
Sep 15 12:40:52 heptagon kernel: process_cipher req src ffff8803cb41f0a8
Sep 15 12:40:52 heptagon kernel: ========= issue hit offset:4110 =======
dma_addr f24b000e ==> DMA mapped address returned by dma_map_sg()
Sep 15 12:40:52 heptagon kernel: DMAR: DRHD: handling fault status reg 2
Sep 15 12:40:52 heptagon kernel: DMAR: [DMA Write] Request device [02:00.4]
fault addr f24b0000 [fault reason 05] PTE Write access is not set
>
> Casey
On 26-09-2017 00:15, David Woodhouse wrote:
> On Wed, 2017-09-20 at 16:01 +0800, Herbert Xu wrote:
>> Harsh Jain <[email protected]> wrote:
>>>
>>> While debugging DMA mapping error in chelsio crypto driver we
>> observed that when scatter/gather list received by driver has some
>> entry with page->offset > 4096 (PAGE_SIZE). It starts giving DMA
>> error. Without IOMMU it works fine.
>>
>> This is not a bug. The network stack can and will feed us such
>> SG lists.
> Hm? Under what circumstances is the offset permitted to be >
> PAGE_SIZE?
Its random, Kernel API's don't check offset value after arithmetic operations like in "__skb_to_sgvec()", "scatterwalk_ffwd()".
On 26-09-2017 01:41, Dan Williams wrote:
> On Mon, Sep 25, 2017 at 1:05 PM, Casey Leedom <[email protected]> wrote:
>> | From: Dan Williams <[email protected]>
>> | Sent: Monday, September 25, 2017 12:31 PM
>> | ...
>> | IIUC it looks like this has been broken ever since commit e1605495c716
>> | "intel-iommu: Introduce domain_sg_mapping() to speed up
>> | intel_map_sg()". I.e. it looks like the calculation for pte_val should
>> | be:
>> |
>> | pteval = (page_to_phys(sg_page(sg)) + sg->offset) | prot;
>>
>> Hhmmm, shouldn't that be:
>>
>> pteval = (page_to_phys(sg_page(sg)) + (sg->offset>>PAGE_SHIFT)) | prot;
> Yes, I think you're right. We do want to mask off the page-unaligned
> portion of sg->offset.
Tried changing above line in "__domain_mapping" but didn't help.
On 26/09/17 13:21, Harsh Jain wrote:
> Find attached new set of log. After repeated tries it panics.
Thanks, that makes things a bit clearer - looks like fixing the physical
address/pteval calculation to not be off by a page in one direction wasn't
helping much because the returned DMA address is actually also off by a
page in the other direction, and thus overflowing past the allocated IOVA
into whoever else's mapping happened to be there; complete carnage ensues.
After another look through the intel_map_sg() path, here's my second (still
completely untested) guess at a possible fix.
Robin.
----->8-----
diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 6784a05dd6b2..d7f7def81613 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -2254,10 +2254,12 @@ static int __domain_mapping(struct dmar_domain *domain, unsigned long iov_pfn,
uint64_t tmp;
if (!sg_res) {
+ size_t off = sg->offset & ~PAGE_MASK;
+
sg_res = aligned_nrpages(sg->offset, sg->length);
- sg->dma_address = ((dma_addr_t)iov_pfn << VTD_PAGE_SHIFT) + sg->offset;
+ sg->dma_address = ((dma_addr_t)iov_pfn << VTD_PAGE_SHIFT) + off;
sg->dma_length = sg->length;
- pteval = page_to_phys(sg_page(sg)) | prot;
+ pteval = (page_to_phys(sg_page(sg)) + sg->offset - off) | prot;
phys_pfn = pteval >> VTD_PAGE_SHIFT;
}
>
>
> On 26-09-2017 09:16, Harsh Jain wrote:
>> On 26-09-2017 00:16, Casey Leedom wrote:
>>> | From: Raj, Ashok <[email protected]>
>>> | Sent: Monday, September 25, 2017 8:54 AM
>>> |
>>> | Not sure how the page->offset would end up being greater than page-size?
>> Refer below
>>> |
>>> | If you have additional traces, please send them by.
>>> |
>>> | Is this a new driver? wondering how we didn't run into this?
>>>
>>> According to Herbert Xu and one of our own engineers, it's actually legal
>>> for Scatter/Gather Lists to have this. This isn't my area of expertise
>>> though so I'm just passing that on.
>>>
>>> I've asked our team to produce a detailed trace of the exact
>>> Scatter/Gather Lists they're seeing and what ends up coming out of the DMA
>>> Mappings, etc. They're in India, so I expect that they'll have this for you
>>> by tomorrow morning.
>> Below mentioned log was already there in 1st mail. Copied here for easy reference. Let me know if you need
>> additional traces.
>>
>> 1) IN esp_output() "__skb_to_sgvec()" convert skb frags to scatter gather list.
>> At that moment sg->offset was 4094.
>> 2) From esp_output control reaches to "crypto_authenc_encrypt()". Here in
>> "scatterwalk_ffwd()" sg->offset become 4110.
>> 3) Same sg list received by chelsio crypto driver(chcr). When chcr try to do
>> DMA mapping it starts giving DMA errors.
>>
>> Following error observed. first two prints are added for debugging in chcr.
>> Kernel version used to reproduce is 4.9.28 on x86_64 with Page size 4K.
>>
>> Sep 15 12:40:52 heptagon kernel: process_cipher req src ffff8803cb41f0a8
>> Sep 15 12:40:52 heptagon kernel: ========= issue hit offset:4110 =======
>> dma_addr f24b000e ==> DMA mapped address returned by dma_map_sg()
>>
>> Sep 15 12:40:52 heptagon kernel: DMAR: DRHD: handling fault status reg 2
>> Sep 15 12:40:52 heptagon kernel: DMAR: [DMA Write] Request device [02:00.4]
>> fault addr f24b0000 [fault reason 05] PTE Write access is not set
>>
>>> Casey
>
| From: Robin Murphy <[email protected]>
| Sent: Tuesday, September 26, 2017 7:22 AM
|
| On 26/09/17 13:21, Harsh Jain wrote:
| > Find attached new set of log. After repeated tries it panics.
|
| Thanks, that makes things a bit clearer - looks like fixing the physical
| address/pteval calculation to not be off by a page in one direction wasn't
| helping much because the returned DMA address is actually also off by a
| page in the other direction, and thus overflowing past the allocated IOVA
| into whoever else's mapping happened to be there; complete carnage ensues.
|
| After another look through the intel_map_sg() path, here's my second (still
| completely untested) guess at a possible fix.
|
| Robin.
|
| ----->8-----
| diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
| index 6784a05dd6b2..d7f7def81613 100644
| --- a/drivers/iommu/intel-iommu.c
| +++ b/drivers/iommu/intel-iommu.c
| @@ -2254,10 +2254,12 @@ static int __domain_mapping(struct dmar_domain *domain, unsigned long iov_pfn,
| uint64_t tmp;
|
| if (!sg_res) {
| + size_t off = sg->offset & ~PAGE_MASK;
| +
| sg_res = aligned_nrpages(sg->offset, sg->length);
| - sg->dma_address = ((dma_addr_t)iov_pfn << VTD_PAGE_SHIFT) + sg->offset;
| + sg->dma_address = ((dma_addr_t)iov_pfn << VTD_PAGE_SHIFT) + off;
| sg->dma_length = sg->length;
| - pteval = page_to_phys(sg_page(sg)) | prot;
| + pteval = (page_to_phys(sg_page(sg)) + sg->offset - off) | prot;
| phys_pfn = pteval >> VTD_PAGE_SHIFT;
| }
Thanks Robin. And thanks Harsh for sending the detailed trace logs. I'll
see if I can get this tested today. Harsh is probably headed towards bed,
but there may be sufficiently good instructions in our internal bug system
to reproduce the issue.
Regardless, it seems that you agree that there's an issue with the Intel
I/O MMU support code with regard to the legal values which a (struct
scatterlist) can take on? I still can't find any documentation for this
and, personally, I'm a bit baffled by a Page-oriented Scatter/Gather List
representation where [Offset, Offset+Length) can reside outside the Page.
If this is a bug in the Intel I/O MMU support code, then we can caveat
to our customers pending an official fix for the issue.
Casey
On Tue, Sep 26, 2017 at 9:06 AM, Casey Leedom <[email protected]> wrote:
> | From: Robin Murphy <[email protected]>
> | Sent: Tuesday, September 26, 2017 7:22 AM
>
> |
> | On 26/09/17 13:21, Harsh Jain wrote:
> | > Find attached new set of log. After repeated tries it panics.
> |
> | Thanks, that makes things a bit clearer - looks like fixing the physical
> | address/pteval calculation to not be off by a page in one direction wasn't
> | helping much because the returned DMA address is actually also off by a
> | page in the other direction, and thus overflowing past the allocated IOVA
> | into whoever else's mapping happened to be there; complete carnage ensues.
> |
> | After another look through the intel_map_sg() path, here's my second
> (still
> | completely untested) guess at a possible fix.
> |
> | Robin.
> |
> | ----->8-----
> | diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
> | index 6784a05dd6b2..d7f7def81613 100644
> | --- a/drivers/iommu/intel-iommu.c
> | +++ b/drivers/iommu/intel-iommu.c
> | @@ -2254,10 +2254,12 @@ static int __domain_mapping(struct dmar_domain
> *domain, unsigned long iov_pfn,
> | uint64_t tmp;
> |
> | if (!sg_res) {
> | + size_t off = sg->offset & ~PAGE_MASK;
> | +
> | sg_res = aligned_nrpages(sg->offset, sg->length);
> | - sg->dma_address = ((dma_addr_t)iov_pfn <<
> VTD_PAGE_SHIFT) + sg->offset;
> | + sg->dma_address = ((dma_addr_t)iov_pfn <<
> VTD_PAGE_SHIFT) + off;
> | sg->dma_length = sg->length;
> | - pteval = page_to_phys(sg_page(sg)) | prot;
> | + pteval = (page_to_phys(sg_page(sg)) + sg->offset -
> off) | prot;
> | phys_pfn = pteval >> VTD_PAGE_SHIFT;
> | }
>
> Thanks Robin. And thanks Harsh for sending the detailed trace logs. I'll
> see if I can get this tested today. Harsh is probably headed towards bed,
> but there may be sufficiently good instructions in our internal bug system
> to reproduce the issue.
>
> Regardless, it seems that you agree that there's an issue with the Intel
> I/O MMU support code with regard to the legal values which a (struct
> scatterlist) can take on? I still can't find any documentation for this
> and, personally, I'm a bit baffled by a Page-oriented Scatter/Gather List
> representation where [Offset, Offset+Length) can reside outside the Page.
Consider the case where the page represents a huge page, then an
offset greater than PAGE_SIZE (up to HPAGE_SIZE) makes sense.
On Tue, Sep 26, 2017 at 03:22:47PM +0100, Robin Murphy wrote:
> diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
> index 6784a05dd6b2..d7f7def81613 100644
> --- a/drivers/iommu/intel-iommu.c
> +++ b/drivers/iommu/intel-iommu.c
> @@ -2254,10 +2254,12 @@ static int __domain_mapping(struct dmar_domain *domain, unsigned long iov_pfn,
> uint64_t tmp;
>
> if (!sg_res) {
> + size_t off = sg->offset & ~PAGE_MASK;
Should this be VTD_PAGE_MASK?
> +
> sg_res = aligned_nrpages(sg->offset, sg->length);
> - sg->dma_address = ((dma_addr_t)iov_pfn << VTD_PAGE_SHIFT) + sg->offset;
> + sg->dma_address = ((dma_addr_t)iov_pfn << VTD_PAGE_SHIFT) + off;
> sg->dma_length = sg->length;
> - pteval = page_to_phys(sg_page(sg)) | prot;
> + pteval = (page_to_phys(sg_page(sg)) + sg->offset - off) | prot;
Something seems wrong here.. sg->offset can be > VTD_PAGE_SIZE, think
we should add sg->offset and then find the pteval?
attached below another cut at fixing the same problem.. if there is something
obvious i missed, let me know.
again.. untested :-)
Cheers,
Ashok
Oops..minor typo.. VTD_PAGE_SHIFT instead of VTD_PAGE_MASK
On Tue, Sep 26, 2017 at 07:34:41AM -0700, Ashok Raj wrote:
> On Tue, Sep 26, 2017 at 03:22:47PM +0100, Robin Murphy wrote:
> > diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
> > index 6784a05dd6b2..d7f7def81613 100644
> > --- a/drivers/iommu/intel-iommu.c
> > +++ b/drivers/iommu/intel-iommu.c
> > @@ -2254,10 +2254,12 @@ static int __domain_mapping(struct dmar_domain *domain, unsigned long iov_pfn,
> > uint64_t tmp;
> >
> > if (!sg_res) {
> > + size_t off = sg->offset & ~PAGE_MASK;
>
> Should this be VTD_PAGE_MASK?
>
> > +
> > sg_res = aligned_nrpages(sg->offset, sg->length);
> > - sg->dma_address = ((dma_addr_t)iov_pfn << VTD_PAGE_SHIFT) + sg->offset;
> > + sg->dma_address = ((dma_addr_t)iov_pfn << VTD_PAGE_SHIFT) + off;
> > sg->dma_length = sg->length;
> > - pteval = page_to_phys(sg_page(sg)) | prot;
> > + pteval = (page_to_phys(sg_page(sg)) + sg->offset - off) | prot;
>
> Something seems wrong here.. sg->offset can be > VTD_PAGE_SIZE, think
> we should add sg->offset and then find the pteval?
>
> attached below another cut at fixing the same problem.. if there is something
> obvious i missed, let me know.
>
> again.. untested :-)
>
> Cheers,
> Ashok
>
> Sometimes offset can be greater than 4K. vt-d needs to account for that.
>
> From: Ashok Raj <[email protected]>
>
> Signed-off-by: Ashok Raj <[email protected]>
> ---
> drivers/iommu/intel-iommu.c | 7 +++++--
> 1 file changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
> index 6784a05..d43b566 100644
> --- a/drivers/iommu/intel-iommu.c
> +++ b/drivers/iommu/intel-iommu.c
> @@ -2254,10 +2254,13 @@ static int __domain_mapping(struct dmar_domain *domain, unsigned long iov_pfn,
> uint64_t tmp;
>
> if (!sg_res) {
> + size_t off = sg->offset & ~VTD_PAGE_SHIFT;
> sg_res = aligned_nrpages(sg->offset, sg->length);
> - sg->dma_address = ((dma_addr_t)iov_pfn << VTD_PAGE_SHIFT) + sg->offset;
> + sg->dma_address = ((dma_addr_t)
> + (iov_pfn + sg->offset) << VTD_PAGE_SHIFT) + off;
> sg->dma_length = sg->length;
> - pteval = page_to_phys(sg_page(sg)) | prot;
> + pteval = (page_to_phys(sg_page(sg)) +
> + (sg->offset << VTD_PAGE_SHIFT)) | prot;
> phys_pfn = pteval >> VTD_PAGE_SHIFT;
> }
>
Sometimes offset can be greater than 4K. vt-d needs to account for that.
From: Ashok Raj <[email protected]>
Signed-off-by: Ashok Raj <[email protected]>
---
drivers/iommu/intel-iommu.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 6784a05..0333afe 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -2254,10 +2254,13 @@ static int __domain_mapping(struct dmar_domain *domain, unsigned long iov_pfn,
uint64_t tmp;
if (!sg_res) {
+ size_t off = sg->offset & ~VTD_PAGE_MASK;
sg_res = aligned_nrpages(sg->offset, sg->length);
- sg->dma_address = ((dma_addr_t)iov_pfn << VTD_PAGE_SHIFT) + sg->offset;
+ sg->dma_address = ((dma_addr_t)
+ (iov_pfn + sg->offset) << VTD_PAGE_SHIFT) + off;
sg->dma_length = sg->length;
- pteval = page_to_phys(sg_page(sg)) | prot;
+ pteval = (page_to_phys(sg_page(sg)) +
+ (sg->offset << VTD_PAGE_SHIFT)) | prot;
phys_pfn = pteval >> VTD_PAGE_SHIFT;
}
| From: Robin Murphy <[email protected]>
| Sent: Tuesday, September 26, 2017 7:22 AM
|
| On 26/09/17 13:21, Harsh Jain wrote:
| > Find attached new set of log. After repeated tries it panics.
|
| Thanks, that makes things a bit clearer - looks like fixing the physical
| address/pteval calculation to not be off by a page in one direction wasn't
| helping much because the returned DMA address is actually also off by a
| page in the other direction, and thus overflowing past the allocated IOVA
| into whoever else's mapping happened to be there; complete carnage ensues.
|
| After another look through the intel_map_sg() path, here's my second (still
| completely untested) guess at a possible fix.
|
| Robin.
|
| ----->8-----
| diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
| index 6784a05dd6b2..d7f7def81613 100644
| --- a/drivers/iommu/intel-iommu.c
| +++ b/drivers/iommu/intel-iommu.c
| @@ -2254,10 +2254,12 @@ static int __domain_mapping(struct dmar_domain *doma!
| uint64_t tmp;
|
| if (!sg_res) {
| + size_t off = sg->offset & ~PAGE_MASK;
| +
| sg_res = aligned_nrpages(sg->offset, sg->length);
| - sg->dma_address = ((dma_addr_t)iov_pfn << VTD_PAGE_S!
| + sg->dma_address = ((dma_addr_t)iov_pfn << VTD_PAGE_S!
| sg->dma_length = sg->length;
| - pteval = page_to_phys(sg_page(sg)) | prot;
| + pteval = (page_to_phys(sg_page(sg)) + sg->offset - o!
| phys_pfn = pteval >> VTD_PAGE_SHIFT;
| }
Robin,
Harsh was able to do an initial test of you proposed patch above and his
test setup survived for 2 minutes with iperf traffic before his PEER went
belly up. Since his PEER isn't on a Remote Power IPMI, we'll have to wait
till tomorrow for further tests -- Harsh does have to sleep some time ...
:-) But the 2 minutes his test machine did survive were Very Promising!
And I see that Raj has also sent a couple of proposed fixes now.
It's just as well that Harsh is off to bed now so we can hash things out
while he sleeps ...
Casey
On 26/09/17 15:34, Raj, Ashok wrote:
> On Tue, Sep 26, 2017 at 03:22:47PM +0100, Robin Murphy wrote:
>> diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
>> index 6784a05dd6b2..d7f7def81613 100644
>> --- a/drivers/iommu/intel-iommu.c
>> +++ b/drivers/iommu/intel-iommu.c
>> @@ -2254,10 +2254,12 @@ static int __domain_mapping(struct dmar_domain *domain, unsigned long iov_pfn,
>> uint64_t tmp;
>>
>> if (!sg_res) {
>> + size_t off = sg->offset & ~PAGE_MASK;
>
> Should this be VTD_PAGE_MASK?
PAGE_MASK (and the corresponding pteval arithmetic) was intentional
here; given the way aligned_nrpages() works, the IOVA space allocated in
intel_map_sg() (and thus iov_pfn) is already rounded to full MM pages,
and it seemed like the original intent was to map the whole lot - this
change is just to make that happen correctly.
Whether it's actually reasonable to decouple the IOMMU and CPU page
sizes entirely (as we do in dma-iommu, for example), and not do the
MM-page-alignment thing at all, is another matter that I'm happy to
leave in your hands :)
Robin.
>> +
>> sg_res = aligned_nrpages(sg->offset, sg->length);
>> - sg->dma_address = ((dma_addr_t)iov_pfn << VTD_PAGE_SHIFT) + sg->offset;
>> + sg->dma_address = ((dma_addr_t)iov_pfn << VTD_PAGE_SHIFT) + off;
>> sg->dma_length = sg->length;
>> - pteval = page_to_phys(sg_page(sg)) | prot;
>> + pteval = (page_to_phys(sg_page(sg)) + sg->offset - off) | prot;
>
> Something seems wrong here.. sg->offset can be > VTD_PAGE_SIZE, think
> we should add sg->offset and then find the pteval?
>
> attached below another cut at fixing the same problem.. if there is something
> obvious i missed, let me know.
>
> again.. untested :-)
>
> Cheers,
> Ashok
>
So just to be 100% sure I understand the patch you're proposing, you got
the first use of VTD_PAGE_SHIFT wrong; it should have been VTD_PAGE_MASK? I.e.
diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 6784a05..d43b566 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -2254,10 +2254,13 @@ static int __domain_mapping(struct dmar_domain *domain, unsigned long iov_pfn,
uint64_t tmp;
if (!sg_res) {
+ size_t off = sg->offset & ~VTD_PAGE_MASK;
sg_res = aligned_nrpages(sg->offset, sg->length);
- sg->dma_address = ((dma_addr_t)iov_pfn << VTD_PAGE_SHIFT) + sg->offset;
+ sg->dma_address = ((dma_addr_t)
+ (iov_pfn + sg->offset) << VTD_PAGE_SHIFT) + off;
sg->dma_length = sg->length;
- pteval = page_to_phys(sg_page(sg)) | prot;
+ pteval = (page_to_phys(sg_page(sg)) +
+ (sg->offset << VTD_PAGE_SHIFT)) | prot;
phys_pfn = pteval >> VTD_PAGE_SHIFT;
}
??? And I'm still confused about this portion:
- sg->dma_address = ((dma_addr_t)iov_pfn << VTD_PAGE_SHIFT) + sg->offset;
+ sg->dma_address = ((dma_addr_t)
+ (iov_pfn + sg->offset) << VTD_PAGE_SHIFT) + off;
Isn't iov_pfn a Physical Page Number and we're adding a Byte Offset
to that? I would have though that it would be more like:
size_t page_off = sg->offset & ~VTD_PAGE_MASK;
unsigned long pfn_off = sg->offset >> VTD_PAGE_MASK;
...
sg->dma_address = ((dma_addr_t)
(iov_pfn + pfn_off) << VTD_PAGE_SHIFT) + page_off;
I want to be sure that Harsh has a concrete patch to work with when he
wakes up.
How about it Robin, Dan, David, Herbert, what do you guys think of Raj's
proposed patch?
Casey
| From: Dan Williams <[email protected]>
| Sent: Tuesday, September 26, 2017 9:10 AM
|
| On Tue, Sep 26, 2017 at 9:06 AM, Casey Leedom <[email protected]> wrote:
| > | From: Robin Murphy <[email protected]>
| > | Sent: Tuesday, September 26, 2017 7:22 AM
| > |...
| > ...
| > Regardless, it seems that you agree that there's an issue with the Intel
| > I/O MMU support code with regard to the legal values which a (struct
| > scatterlist) can take on? I still can't find any documentation for this
| > and, personally, I'm a bit baffled by a Page-oriented Scatter/Gather List
| > representation where [Offset, Offset+Length) can reside outside the Page.
|
| Consider the case where the page represents a huge page, then an
| offset greater than PAGE_SIZE (up to HPAGE_SIZE) makes sense.
Okay, but whatever the underlaying Page Size is, should [Offset,
Offset+Length) completely reside within the referenced Page? I'm just
trying to understand the Invariance Conditions which are assumed by all of
the code which processes Scatter/gather Lists ...
Casey
On Wed, Sep 27, 2017 at 9:31 AM, Casey Leedom <[email protected]> wrote:
> | From: Dan Williams <[email protected]>
> | Sent: Tuesday, September 26, 2017 9:10 AM
> |
> | On Tue, Sep 26, 2017 at 9:06 AM, Casey Leedom <[email protected]> wrote:
> | > | From: Robin Murphy <[email protected]>
> | > | Sent: Tuesday, September 26, 2017 7:22 AM
> | > |...
> | > ...
> | > Regardless, it seems that you agree that there's an issue with the Intel
> | > I/O MMU support code with regard to the legal values which a (struct
> | > scatterlist) can take on? I still can't find any documentation for this
> | > and, personally, I'm a bit baffled by a Page-oriented Scatter/Gather List
> | > representation where [Offset, Offset+Length) can reside outside the Page.
> |
> | Consider the case where the page represents a huge page, then an
> | offset greater than PAGE_SIZE (up to HPAGE_SIZE) makes sense.
>
> Okay, but whatever the underlaying Page Size is, should [Offset,
> Offset+Length) completely reside within the referenced Page? I'm just
> trying to understand the Invariance Conditions which are assumed by all of
> the code which processes Scatter/gather Lists ...
As far as I can see "Offset can be greater than PAGE_SIZE" is the only
safe assumption for core code.
On Wed, 27 Sep 2017 16:31:04 +0000
Casey Leedom <[email protected]> wrote:
> | From: Dan Williams <[email protected]>
> | Sent: Tuesday, September 26, 2017 9:10 AM
> |
> | On Tue, Sep 26, 2017 at 9:06 AM, Casey Leedom <[email protected]>
> wrote: | > | From: Robin Murphy <[email protected]>
> | > | Sent: Tuesday, September 26, 2017 7:22 AM
> | > |...
> | > ...
> | > Regardless, it seems that you agree that there's an issue with
> the Intel | > I/O MMU support code with regard to the legal values
> which a (struct | > scatterlist) can take on? I still can't find any
> documentation for this | > and, personally, I'm a bit baffled by a
> Page-oriented Scatter/Gather List | > representation where [Offset,
> Offset+Length) can reside outside the Page. |
> | Consider the case where the page represents a huge page, then an
> | offset greater than PAGE_SIZE (up to HPAGE_SIZE) makes sense.
>
> Okay, but whatever the underlaying Page Size is, should [Offset,
> Offset+Length) completely reside within the referenced Page? I'm just
> trying to understand the Invariance Conditions which are assumed by
> all of the code which processes Scatter/gather Lists ...
From my experience, in general terms each scatterlist segment
represents some contiguous quantity of pages, of which sg->page is the
first, while sg->length and sg->offset describe the specific bounds of
that segment's data. As such, the length may certainly (and frequently
does) exceed PAGE_SIZE; for the offset, it's unlikely that the producer
would initially construct one greater than PAGE_SIZE instead of just
pointing sg->page further forward, but it seems reasonable for it to
come about if some intermediate subsystem is processing an existing
list in-place (as seems to be the case with crypto here).
My opinion is that this may be a slightly unusual case, but I would
not consider it an illegal one. I think most DMA mapping
implementations would handle it whether intentionally or not.
Robin.
| From: Robin Murphy <[email protected]>
| Sent: Wednesday, September 27, 2017 10:18 AM
|
| From my experience, in general terms each scatterlist segment
| represents some contiguous quantity of pages, of which sg->page is the
| first, while sg->length and sg->offset describe the specific bounds of
| that segment's data. ...
Okay, thanks Robin. That'll help me in my reviews of your and Ashok's
suggested changes to the Intel I/O MMU __domain_mapping() routine.
Casey
Hi Robin
On Wed, Sep 27, 2017 at 06:18:02PM +0100, Robin Murphy wrote:
> On Wed, 27 Sep 2017 16:31:04 +0000
> Casey Leedom <[email protected]> wrote:
>
> > | From: Dan Williams <[email protected]>
> > | Sent: Tuesday, September 26, 2017 9:10 AM
> > |
> > | On Tue, Sep 26, 2017 at 9:06 AM, Casey Leedom <[email protected]>
> > wrote: | > | From: Robin Murphy <[email protected]>
> > | > | Sent: Tuesday, September 26, 2017 7:22 AM
> > | > |...
> > | > ...
> > | > Regardless, it seems that you agree that there's an issue with
> > the Intel | > I/O MMU support code with regard to the legal values
> > which a (struct | > scatterlist) can take on? I still can't find any
> > documentation for this | > and, personally, I'm a bit baffled by a
> > Page-oriented Scatter/Gather List | > representation where [Offset,
> > Offset+Length) can reside outside the Page. |
> > | Consider the case where the page represents a huge page, then an
> > | offset greater than PAGE_SIZE (up to HPAGE_SIZE) makes sense.
> >
> > Okay, but whatever the underlaying Page Size is, should [Offset,
> > Offset+Length) completely reside within the referenced Page? I'm just
> > trying to understand the Invariance Conditions which are assumed by
> > all of the code which processes Scatter/gather Lists ...
>
> From my experience, in general terms each scatterlist segment
> represents some contiguous quantity of pages, of which sg->page is the
> first, while sg->length and sg->offset describe the specific bounds of
> that segment's data. As such, the length may certainly (and frequently
> does) exceed PAGE_SIZE; for the offset, it's unlikely that the producer
> would initially construct one greater than PAGE_SIZE instead of just
> pointing sg->page further forward, but it seems reasonable for it to
> come about if some intermediate subsystem is processing an existing
> list in-place (as seems to be the case with crypto here).
>
> My opinion is that this may be a slightly unusual case, but I would
> not consider it an illegal one. I think most DMA mapping
> implementations would handle it whether intentionally or not.
In this specific case, it appears that
scatterwalk_ffwd()->sg_set_page()
sg_set_page(dst, sg_page(src), src->length - len, src->offset + len);
and
static inline void sg_set_page(struct scatterlist *sg, struct page *page,
unsigned int len, unsigned int offset)
{
sg_assign_page(sg, page);
sg->offset = offset;
sg->length = len;
}
The src->offset + len seems to be the culprit putting it past the page.
Looks like in the cases when it breaks, the offset is already towards
the end of page.. and adding the len, puts it over the limit.
When dealing with the offset > PAGE_SIZE, is the expectation you have another
additional entry for sgl? for e.g.
if sg->page = X, and offset=4092. and len = 16. Since IOMMU only understands
4K pages this last entry needs to be adjusted?
I'm not sure if the offset+len is a buffer overflow situation or just
trips IOMMU.
Cheers,
Ashok
the scatter gather list, should we
Hey Raj,
Let us know if you need help in gathering more debugging information. For
the time being we've decided to ERRATA the use of the Intel I/O MMU with
IPsec till we Root Cause the issue. But this is still at the top of Harsh's
bug list.
With Robin's comments, I'm almost sure that the:
(iov_pfn + sg->offset) << VTD_PAGE_SHIFT)
in your suggested patch is an issue. iov_pfn is a Page Frame Number and
sg->offset is a Byte Offset. It feels like this should be:
size_t page_off = sg->offset & ~VTD_PAGE_MASK;
unsigned long pfn_off = sg->offset >> VTD_PAGE_MASK;
...
sg->dma_address = ((dma_addr_t)
(iov_pfn + pfn_off) << VTD_PAGE_SHIFT) + page_off;
When Harsh tried your original patch, Harsh' test system wouldn't even boot.
Casey
Hi Casey
looking at the debug output i got from Harsh it still looks like
a bug in the code.
[ 538.284589] __domain_mapping nr_pages 0x1
[ 538.284600] __domain_mapping sg_res 0x1 sg->dma_address 0xf291000e dma len 0x38 pteval 0x3cbce3003 phys_pfn 0x3cbce3
[ 538.284604] chelsio driver - offset 4110 len 56 dma addr f291000e dma len 56
[ 538.284667] DMAR: DRHD: handling fault status reg 2
[ 538.290017] DMAR: [DMA Write] Request device [02:00.4] fault addr f2910000 [fault reason 05] PTE Write access is not set
somehow when crypto_authenc_encrypt() -> scatterwalk_ffwd()-> sg_set_page()
->sg_set_page(dst, sg_page(src), src->length - len, src->offset + len);
src->offset + len gets set as sg->offset in sg_set_page(). Either the
assumption that there should be room is incorrect, or some higher order crypto
code that ends up setting the offset did the wrong calculation.
if src->offset is already towards the end of the page, then offset+len will
go beyond the end of page.
On Wed, Sep 27, 2017 at 09:29:23PM +0000, Casey Leedom wrote:
> Hey Raj,
>
> Let us know if you need help in gathering more debugging information. For
> the time being we've decided to ERRATA the use of the Intel I/O MMU with
> IPsec till we Root Cause the issue. But this is still at the top of Harsh's
> bug list.
>
> With Robin's comments, I'm almost sure that the:
>
> (iov_pfn + sg->offset) << VTD_PAGE_SHIFT)
true, but this is the IOVA- IO Virtual address generated by the
dma_map call. Thought in cases when sg->offset is beyond a page, then
the new iov_pfn should fall on the next page. But we can't randomly adjust
here, unless IOMMU has also allocated IOVA for the page overflow.
Cheers,
Ashok
| From: Raj, Ashok <[email protected]>
| Sent: Wednesday, September 27, 2017 12:07 PM
|
| looking at the debug output i got from Harsh it still looks like a bug in
| the code.
|
| [ 538.284589] __domain_mapping nr_pages 0x1
| [ 538.284600] __domain_mapping sg_res 0x1 sg->dma_address 0xf291000e dma len
| 0x38 pteval 0x3cbce3003 phys_pfn 0x3cbce3
| [ 538.284604] chelsio driver - offset 4110 len 56 dma addr f291000e dma len
| 56
| [ 538.284667] DMAR: DRHD: handling fault status reg 2
| [ 538.290017] DMAR: [DMA Write] Request device [02:00.4] fault addr f2910000
| [fault reason 05] PTE Write access is not set
|
| somehow when crypto_authenc_encrypt() -> scatterwalk_ffwd()-> sg_set_page()
|
| ->sg_set_page(dst, sg_page(src), src->length - len, src->offset + len);
|
| src->offset + len gets set as sg->offset in sg_set_page(). Either the
| assumption that there should be room is incorrect, or some higher order
| crypto
| code that ends up setting the offset did the wrong calculation.
|
| if src->offset is already towards the end of the page, then offset+len will
| go beyond the end of page.
Hhmmm, it seems like we need Herbert to comment on this.
Herbert, is there any specific debugging information that you'd like to
see here?
Casey
On 28-09-2017 03:43, Casey Leedom wrote:
> | From: Raj, Ashok <[email protected]>
> | Sent: Wednesday, September 27, 2017 12:07 PM
> |
> | looking at the debug output i got from Harsh it still looks like a bug in
> | the code.
> |
> | [ 538.284589] __domain_mapping nr_pages 0x1
> | [ 538.284600] __domain_mapping sg_res 0x1 sg->dma_address 0xf291000e dma len
> | 0x38 pteval 0x3cbce3003 phys_pfn 0x3cbce3
> | [ 538.284604] chelsio driver - offset 4110 len 56 dma addr f291000e dma len
> | 56
> | [ 538.284667] DMAR: DRHD: handling fault status reg 2
> | [ 538.290017] DMAR: [DMA Write] Request device [02:00.4] fault addr f2910000
> | [fault reason 05] PTE Write access is not set
> |
> | somehow when crypto_authenc_encrypt() -> scatterwalk_ffwd()-> sg_set_page()
> |
> | ->sg_set_page(dst, sg_page(src), src->length - len, src->offset + len);
> |
> | src->offset + len gets set as sg->offset in sg_set_page(). Either the
> | assumption that there should be room is incorrect, or some higher order
> | crypto
Input received from user(Here XFRM) contains AAD(Additional Authentication data) || DATA(enc/dec) || Tag(hash). before passing input
to Cipher engine(chelsio) crypto_authenc_encrypt has to skip AAD which is 16 in our case. To skip that 16 bytes they simply incremented offset by 16.
I think Robin is right DMA mapping should handle it.
> | code that ends up setting the offset did the wrong calculation.
> |
> | if src->offset is already towards the end of the page, then offset+len will
> | go beyond the end of page.
>
> Hhmmm, it seems like we need Herbert to comment on this.
>
> Herbert, is there any specific debugging information that you'd like to
> see here?
>
> Casey
On Wed, Sep 27, 2017 at 10:13:04PM +0000, Casey Leedom wrote:
> | From: Raj, Ashok <[email protected]>
> | Sent: Wednesday, September 27, 2017 12:07 PM
> |
> | looking at the debug output i got from Harsh it still looks like a bug in
> | the code.
> |
> | [ 538.284589] __domain_mapping nr_pages 0x1
> | [ 538.284600] __domain_mapping sg_res 0x1 sg->dma_address 0xf291000e dma len
> | 0x38 pteval 0x3cbce3003 phys_pfn 0x3cbce3
> | [ 538.284604] chelsio driver - offset 4110 len 56 dma addr f291000e dma len
> | 56
> | [ 538.284667] DMAR: DRHD: handling fault status reg 2
> | [ 538.290017] DMAR: [DMA Write] Request device [02:00.4] fault addr f2910000
> | [fault reason 05] PTE Write access is not set
> |
> | somehow when crypto_authenc_encrypt() -> scatterwalk_ffwd()-> sg_set_page()
> |
> | ->sg_set_page(dst, sg_page(src), src->length - len, src->offset + len);
> |
> | src->offset + len gets set as sg->offset in sg_set_page(). Either the
> | assumption that there should be room is incorrect, or some higher order
> | crypto
> | code that ends up setting the offset did the wrong calculation.
> |
> | if src->offset is already towards the end of the page, then offset+len will
> | go beyond the end of page.
>
> Hhmmm, it seems like we need Herbert to comment on this.
>
> Herbert, is there any specific debugging information that you'd like to
> see here?
OK I was mistaken. While SG lists can contain entries that are
larger than PAGE_SIZE, there is no reason why scatterwalk_ffwd
should gratuitously insert a page_offset that is greater than
PAGE_SIZE.
Harsh, can you please submit your original patch with a sign-off?
Thanks,
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
On 28-09-2017 16:03, Herbert Xu wrote:
> On Wed, Sep 27, 2017 at 10:13:04PM +0000, Casey Leedom wrote:
>> | From: Raj, Ashok <[email protected]>
>> | Sent: Wednesday, September 27, 2017 12:07 PM
>> |
>> | looking at the debug output i got from Harsh it still looks like a bug in
>> | the code.
>> |
>> | [ 538.284589] __domain_mapping nr_pages 0x1
>> | [ 538.284600] __domain_mapping sg_res 0x1 sg->dma_address 0xf291000e dma len
>> | 0x38 pteval 0x3cbce3003 phys_pfn 0x3cbce3
>> | [ 538.284604] chelsio driver - offset 4110 len 56 dma addr f291000e dma len
>> | 56
>> | [ 538.284667] DMAR: DRHD: handling fault status reg 2
>> | [ 538.290017] DMAR: [DMA Write] Request device [02:00.4] fault addr f2910000
>> | [fault reason 05] PTE Write access is not set
>> |
>> | somehow when crypto_authenc_encrypt() -> scatterwalk_ffwd()-> sg_set_page()
>> |
>> | ->sg_set_page(dst, sg_page(src), src->length - len, src->offset + len);
>> |
>> | src->offset + len gets set as sg->offset in sg_set_page(). Either the
>> | assumption that there should be room is incorrect, or some higher order
>> | crypto
>> | code that ends up setting the offset did the wrong calculation.
>> |
>> | if src->offset is already towards the end of the page, then offset+len will
>> | go beyond the end of page.
>>
>> Hhmmm, it seems like we need Herbert to comment on this.
>>
>> Herbert, is there any specific debugging information that you'd like to
>> see here?
> OK I was mistaken. While SG lists can contain entries that are
> larger than PAGE_SIZE, there is no reason why scatterwalk_ffwd
> should gratuitously insert a page_offset that is greater than
> PAGE_SIZE.
>
> Harsh, can you please submit your original patch with a sign-off?
Herbert, There are some points which bothers me.
1) That patch solve panic later on after reporting the issue, I can see DMA_write error traces with iperf traffic.
2) It seems we(crypto) are not the only one who adjust offsets like this. refer "__skb_to_sgvec <http://elixir.free-electrons.com/linux/latest/ident/__skb_to_sgvec>" . There can be many more.
sg_set_page <http://elixir.free-electrons.com/linux/latest/ident/sg_set_page>(&sg[elt], skb_frag_page <http://elixir.free-electrons.com/linux/latest/ident/skb_frag_page>(frag <http://elixir.free-electrons.com/linux/latest/ident/frag>), copy <http://elixir.free-electrons.com/linux/latest/ident/copy>,
frag <http://elixir.free-electrons.com/linux/latest/ident/frag>->page_offset <http://elixir.free-electrons.com/linux/latest/ident/page_offset>+offset-start); 3) Yesterday Ashok suggested to run the test with boot parameter Intel_iommu=sp_off. With yhis Even Iperf traffic runs seemlessly.
Don't know what this has changed.
>
> Thanks,
On 28-09-2017 02:59, Casey Leedom wrote:
> Hey Raj,
>
> Let us know if you need help in gathering more debugging information. For
> the time being we've decided to ERRATA the use of the Intel I/O MMU with
> IPsec till we Root Cause the issue. But this is still at the top of Harsh's
> bug list.
>
> With Robin's comments, I'm almost sure that the:
>
> (iov_pfn + sg->offset) << VTD_PAGE_SHIFT)
>
> in your suggested patch is an issue. iov_pfn is a Page Frame Number and
> sg->offset is a Byte Offset. It feels like this should be:
>
> size_t page_off = sg->offset & ~VTD_PAGE_MASK;
> unsigned long pfn_off = sg->offset >> VTD_PAGE_MASK;
> ...
> sg->dma_address = ((dma_addr_t)
> (iov_pfn + pfn_off) << VTD_PAGE_SHIFT) + page_off;
>
> When Harsh tried your original patch, Harsh' test system wouldn't even boot.
Today I tried with "Intel_iommu=sp_off" boot option. Traffic runs without any error for more than 1 hrs. What magic this option did? :)
>
> Casey
Thanks for trying that Harsh.
sp_off turns of super page support. Which this mode, do you still see offsets greater than 4k?
On Thu, Sep 28, 2017 at 07:08:21PM +0530, Harsh Jain wrote:
>
>
> Today I tried with "Intel_iommu=sp_off" boot option. Traffic runs without any error for more than 1 hrs. What magic this option did? :)
Cheers,
Ashok
On 28-09-2017 18:35, Raj, Ashok wrote:
> Thanks for trying that Harsh.
>
> sp_off turns of super page support. Which this mode, do you still see offsets greater than 4k?
Yes, offset greater than 4k is still there. Refer below.
[56732.774872] offset 4110 len 76 dma addr 3a531200e dma len 76
[56732.804187] offset 4110 len 84 dma addr 3a63b200e dma len 84
[56732.805104] offset 4110 len 68 dma addr 3a531200e dma len 68
[56732.806870] offset 4110 len 56 dma addr 3a531200e dma len 56
[56732.808987] offset 4110 len 56 dma addr 3a531200e dma len 56
[56732.811215] offset 4110 len 56 dma addr 3a531200e dma len 56
[56732.813155] offset 4110 len 56 dma addr 3a531200e dma len 56
[56732.814823] offset 4110 len 56 dma addr 3a531200e dma len 56
[56732.816481] offset 4110 len 56 dma addr 3a531200e dma len 56
[56732.818159] offset 4110 len 56 dma addr 3a531200e dma len 56
[56732.819712] offset 4110 len 56 dma addr 3a531200e dma len 56
[56732.821629] offset 4110 len 56 dma addr 3a531200e dma len 56
[root@heptagon linux_t4_build]#
[root@heptagon linux_t4_build]#
[root@heptagon linux_t4_build]# cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-4.9.51 root=UUID=ccbb7f18-b3f0-43df-89de-07521e9c02fe ro intel_iommu=sp_off crashkernel=auto rhgb quiet rhgb quiet console=ttyS0,115200, console=tty0 LANG=en_US.UTF-8
>
> On Thu, Sep 28, 2017 at 07:08:21PM +0530, Harsh Jain wrote:
>>
>> Today I tried with "Intel_iommu=sp_off" boot option. Traffic runs without any error for more than 1 hrs. What magic this option did? :)
> Cheers,
> Ashok
On Wed, Sep 27, 2017 at 10:13:51AM -0700, Dan Williams wrote:
> As far as I can see "Offset can be greater than PAGE_SIZE" is the only
> safe assumption for core code.
It seems completely bogus to me, but if it is the current assumption
we'll have to document it. But this brings me back to that
our scatterlists are a pretty horrible data structure to start
with as they try to mix virtual and physical addressing together.
We'd be much better of by passing a chain of bio_vecs where we
just need virtual addresses, a chain of [bus_addr,len] pairs where
we just need a physical address, and both where we need both instead
of this giant structure that tries to do both at the same time..