The intel-iommu DMA ops fail to correctly handle scatterlists where
sg->offset is greater than PAGE_SIZE - the IOVA allocation is computed
appropriately based on the page-aligned portion of the offset, but the
mapping is set up relative to sg->page, which means it fails to actually
cover the whole buffer (and in the worst case doesn't cover it at all):
(sg->dma_address + sg->dma_len) ----+
sg->dma_address ---------+ |
iov_pfn------+ | |
| | |
v v v
iova: a b c d e f
|--------|--------|--------|--------|--------|
<...calculated....>
[_____mapped______]
pfn: 0 1 2 3 4 5
|--------|--------|--------|--------|--------|
^ ^ ^
| | |
sg->page ----+ | |
sg->offset --------------+ |
(sg->offset + sg->length) ----------+
As a result, the caller ends up overrunning the mapping into whatever
lies beyond, which usually goes badly:
[ 429.645492] DMAR: DRHD: handling fault status reg 2
[ 429.650847] DMAR: [DMA Write] Request device [02:00.4] fault addr f2682000 ...
Whilst this is a fairly rare occurrence, it can happen from the result
of intermediate scatterlist processing such as scatterwalk_ffwd() in the
crypto layer. Whilst that particular site could be fixed up, it still
seems worthwhile to bring intel-iommu in line with other DMA API
implementations in handling this robustly.
To that end, fix the intel_map_sg() path to line up the mapping
correctly (in units of MM pages rather than VT-d pages to match the
aligned_nrpages() calculation) regardless of the offset, and use
sg_phys() consistently for clarity.
Reported-by: Harsh Jain <[email protected]>
Signed-off-by: Robin Murphy <[email protected]>
---
drivers/iommu/intel-iommu.c | 8 +++++---
1 file changed, 5 insertions(+), 3 deletions(-)
diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 6784a05dd6b2..83f3d4831f94 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -2254,10 +2254,12 @@ static int __domain_mapping(struct dmar_domain *domain, unsigned long iov_pfn,
uint64_t tmp;
if (!sg_res) {
+ unsigned int pgoff = sg->offset & ~PAGE_MASK;
+
sg_res = aligned_nrpages(sg->offset, sg->length);
- sg->dma_address = ((dma_addr_t)iov_pfn << VTD_PAGE_SHIFT) + sg->offset;
+ sg->dma_address = ((dma_addr_t)iov_pfn << VTD_PAGE_SHIFT) + pgoff;
sg->dma_length = sg->length;
- pteval = page_to_phys(sg_page(sg)) | prot;
+ pteval = (sg_phys(sg) - pgoff) | prot;
phys_pfn = pteval >> VTD_PAGE_SHIFT;
}
@@ -3790,7 +3792,7 @@ static int intel_nontranslate_map_sg(struct device *hddev,
for_each_sg(sglist, sg, nelems, i) {
BUG_ON(!sg_page(sg));
- sg->dma_address = page_to_phys(sg_page(sg)) + sg->offset;
+ sg->dma_address = sg_phys(sg);
sg->dma_length = sg->length;
}
return nelems;
--
2.13.4.dirty
Thanks Robin. Harsh can certainly test your latest patch as soon as he's
back in the office tomorrow morning India time. If your patch works and is
accepted, it sounds like the commit would be important enough to consider
backporting into various Long-Term Support releases and the affected
distributions. What's the procedure for nominating a commit for LTS inclusion?
Casey
Hi Casey
On Thu, Sep 28, 2017 at 04:17:59PM +0000, Casey Leedom wrote:
> Thanks Robin. Harsh can certainly test your latest patch as soon as he's
> back in the office tomorrow morning India time. If your patch works and is
> accepted, it sounds like the commit would be important enough to consider
> backporting into various Long-Term Support releases and the affected
> distributions. What's the procedure for nominating a commit for LTS inclusion?
its documented in Documentation/process/submitting-patches
I didn't see a new patch fly by.. Robin, could you send that over?
Cheers,
Ashok
On 28/09/17 14:29, Raj, Ashok wrote:
> Hi Casey
>
> On Thu, Sep 28, 2017 at 04:17:59PM +0000, Casey Leedom wrote:
>> Thanks Robin. Harsh can certainly test your latest patch as soon as he's
>> back in the office tomorrow morning India time. If your patch works and is
>> accepted, it sounds like the commit would be important enough to consider
>> backporting into various Long-Term Support releases and the affected
>> distributions. What's the procedure for nominating a commit for LTS inclusion?
I tend to leave stable decisions up to Joerg as the subsystem
maintainer, particularly when it's code outside my usual areas of
familiarity. FWIW, from a real dig through the history, the fragile
logic seems to date from the 2.6 days, having snuck in with b536d24d212c
("intel-iommu: Clean up intel_map_sg(), remove domain_page_mapping()")
> its documented in Documentation/process/submitting-patches
>
> I didn't see a new patch fly by.. Robin, could you send that over?
I hope our email server hasn't got blacklisted again... Said patch is
the top of this very thread we're replying on[1] - you were definitely
on cc :(
Robin.
[1]:https://lists.linuxfoundation.org/pipermail/iommu/2017-September/024371.html
Hi Robin
thanks.. i have no idea.. i see all the other patches from you :-)
my email has decided to play games with me i suppose :-)
On Thu, Sep 28, 2017 at 05:59:11PM +0100, Robin Murphy wrote:
> I hope our email server hasn't got blacklisted again... Said patch is
> the top of this very thread we're replying on[1] - you were definitely
> on cc :(
(sg->dma_address + sg->dma_len) ----+
sg->dma_address ---------+ |
iov_pfn------+ | |
| | |
v v v
iova: a b c d e f
|--------|--------|--------|--------|--------|
<...calculated....>
[_____mapped______]
pfn: 0 1 2 3 4 5
|--------|--------|--------|--------|--------|
^ ^ ^
| | |
sg->page ----+ | |
sg->offset --------------+ |
(sg->offset + sg->length) ----------+
The picture seems right. Looking at the code i'm not sure if i understand
it correctly.
pgoff = sg->offset & ~PAGE_MASK;
this gets the offset past the start of page.
sg->dma_address = ((dma_addr_t)iov_pfn << VTD_PAGE_SHIFT) + pgoff;
this would set dma_address at b+off instead of starting at c+off correct?
>
> Robin
>
> [1]:https://lists.linuxfoundation.org/pipermail/iommu/2017-September/024371.html
Robin,
I tried running patch on our test setup.
With "intel_iommu=on" : I can see single occurrence of DMAR Write failure on perf traffic with 10 thread.
[ 749.616480] perf: interrupt took too long (3203 > 3202), lowering kernel.perf_event_max_sample_rate to 62000
[ 852.500671] DMAR: DRHD: handling fault status reg 2
[ 852.506039] DMAR: [DMA Write] Request device [02:00.4] fault addr ef919000 [fault reason 05] PTE Write access is not set
[root@heptagon linux_t4_build]# cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-4.9.51+ root=UUID=ccbb7f18-b3f0-43df-89de-07521e9c02fe ro intel_iommu=on crashkernel=auto rhgb quiet rhgb quiet console=ttyS0,115200, console=tty0 LANG=en_US.UTF-8
With intel_iommu=sp_off : It works fine for more than 30 minutes without any issues.
On 28-09-2017 19:44, Robin Murphy wrote:
> The intel-iommu DMA ops fail to correctly handle scatterlists where
> sg->offset is greater than PAGE_SIZE - the IOVA allocation is computed
> appropriately based on the page-aligned portion of the offset, but the
> mapping is set up relative to sg->page, which means it fails to actually
> cover the whole buffer (and in the worst case doesn't cover it at all):
>
> (sg->dma_address + sg->dma_len) ----+
> sg->dma_address ---------+ |
> iov_pfn------+ | |
> | | |
> v v v
> iova: a b c d e f
> |--------|--------|--------|--------|--------|
> <...calculated....>
> [_____mapped______]
> pfn: 0 1 2 3 4 5
> |--------|--------|--------|--------|--------|
> ^ ^ ^
> | | |
> sg->page ----+ | |
> sg->offset --------------+ |
> (sg->offset + sg->length) ----------+
>
> As a result, the caller ends up overrunning the mapping into whatever
> lies beyond, which usually goes badly:
>
> [ 429.645492] DMAR: DRHD: handling fault status reg 2
> [ 429.650847] DMAR: [DMA Write] Request device [02:00.4] fault addr f2682000 ...
>
> Whilst this is a fairly rare occurrence, it can happen from the result
> of intermediate scatterlist processing such as scatterwalk_ffwd() in the
> crypto layer. Whilst that particular site could be fixed up, it still
> seems worthwhile to bring intel-iommu in line with other DMA API
> implementations in handling this robustly.
>
> To that end, fix the intel_map_sg() path to line up the mapping
> correctly (in units of MM pages rather than VT-d pages to match the
> aligned_nrpages() calculation) regardless of the offset, and use
> sg_phys() consistently for clarity.
>
> Reported-by: Harsh Jain <[email protected]>
> Signed-off-by: Robin Murphy <[email protected]>
> ---
> drivers/iommu/intel-iommu.c | 8 +++++---
> 1 file changed, 5 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
> index 6784a05dd6b2..83f3d4831f94 100644
> --- a/drivers/iommu/intel-iommu.c
> +++ b/drivers/iommu/intel-iommu.c
> @@ -2254,10 +2254,12 @@ static int __domain_mapping(struct dmar_domain *domain, unsigned long iov_pfn,
> uint64_t tmp;
>
> if (!sg_res) {
> + unsigned int pgoff = sg->offset & ~PAGE_MASK;
> +
> sg_res = aligned_nrpages(sg->offset, sg->length);
> - sg->dma_address = ((dma_addr_t)iov_pfn << VTD_PAGE_SHIFT) + sg->offset;
> + sg->dma_address = ((dma_addr_t)iov_pfn << VTD_PAGE_SHIFT) + pgoff;
> sg->dma_length = sg->length;
> - pteval = page_to_phys(sg_page(sg)) | prot;
> + pteval = (sg_phys(sg) - pgoff) | prot;
> phys_pfn = pteval >> VTD_PAGE_SHIFT;
> }
>
> @@ -3790,7 +3792,7 @@ static int intel_nontranslate_map_sg(struct device *hddev,
>
> for_each_sg(sglist, sg, nelems, i) {
> BUG_ON(!sg_page(sg));
> - sg->dma_address = page_to_phys(sg_page(sg)) + sg->offset;
> + sg->dma_address = sg_phys(sg);
> sg->dma_length = sg->length;
> }
> return nelems;
| From: Harsh Jain <[email protected]>
| Sent: Friday, September 29, 2017 1:14:45 AM
|
| Robin,
|
| I tried running patch on our test setup.
|
| With "intel_iommu=on" : I can see single occurrence of DMAR Write failure
| on perf traffic with 10 thread.
|
| [ 749.616480] perf: interrupt took too long (3203 > 3202), lowering kernel.perf_event_max_sample_rate to 62000
| [ 852.500671] DMAR: DRHD: handling fault status reg 2
| [ 852.506039] DMAR: [DMA Write] Request device [02:00.4] fault addr ef919000 [fault reason 05] PTE Write access is not set
| [root@heptagon linux_t4_build]# cat /proc/cmdline
| BOOT_IMAGE=/vmlinuz-4.9.51+ root=UUID=ccbb7f18-b3f0-43df-89de-07521e9c02fe ro intel_iommu=on crashkernel=auto rhgb quiet rhgb quiet console=ttyS0,115200, console=tty0 LANG=en_US.UTF-8
Harsh. Can you provide the debugging information for that one DMA FAILURE
trace? It May be yet another corner case in __domain_mapping() or a
different path.
| With intel_iommu=sp_off : It works fine for more than 30 minutes without
| any issues.
I think that even without Robin's patch using intel_iommu=sp_off worked
without errors, right?
Casey
| From: Raj, Ashok <[email protected]>
| Sent: Friday, November 17, 2017 7:48 AM
|
| Reported by: Harsh <[email protected]>
| Reviewed by: Ashok Raj <[email protected]>
| Tested by: Jacob Pan <[email protected]>
Thanks everyone! I've updated our internal bug on this issue
and noted that we need to track down the remaining problems
which may be in our own code.
Casey
From 1584356276572923405@xxx Fri Nov 17 23:04:49 +0000 2017
X-GM-THRID: 1579793157804914964
X-Gmail-Labels: Inbox,Category Forums,HistoricalUnread
Hi Alex
On Fri, Nov 17, 2017 at 09:18:14AM -0700, Alex Williamson wrote:
> On Thu, 16 Nov 2017 13:09:33 -0800
> "Raj, Ashok" <[email protected]> wrote:
>
> > >
> > > What do we do about this? I certainly can't rip out large page support
> > > and put a stable tag on the patch. I'm not really spotting what's
> > > wrong with large page support here, other than the comment about it
> > > being a mess. Suggestions? Thanks,
> > >
> >
> > Largepage seems to work and i don't think we need to rip it out. When
> > Harsh tested it at one point we thought disabling super-page seemed to make
> > the problem go away. Jacob tested and we still saw the need for Robin's patch.
> >
> > Yes, the function looks humongous but i don't think we should wait for that
> > before this merge.
>
> Ok. Who wants to toss in review and testing sign-offs? Clearly
> there's been a lot more eyes and effort on this patch than reflected in
> the original posting. I'll add a stable cc. Thanks,
Reported by: Harsh <[email protected]>
Reviewed by: Ashok Raj <[email protected]>
Tested by: Jacob Pan <[email protected]>
>
> Alex
From 1584348169955377162@xxx Fri Nov 17 20:55:57 +0000 2017
X-GM-THRID: 1579793157804914964
X-Gmail-Labels: Inbox,Category Forums,HistoricalUnread
On Fri, 17 Nov 2017 17:44:57 +0000
Casey Leedom <[email protected]> wrote:
> | From: Raj, Ashok <[email protected]>
> | Sent: Friday, November 17, 2017 7:48 AM
> |
> | Reported by: Harsh <[email protected]>
> | Reviewed by: Ashok Raj <[email protected]>
> | Tested by: Jacob Pan <[email protected]>
>
> Thanks everyone! I've updated our internal bug on this issue
> and noted that we need to track down the remaining problems
> which may be in our own code.
>
All sounds good to me, let me know if you need further assistance on
vt-d driver.
Jacob
> Casey
From 1584344123774111177@xxx Fri Nov 17 19:51:39 +0000 2017
X-GM-THRID: 1579793157804914964
X-Gmail-Labels: Inbox,Category Forums,HistoricalUnread
On Thu, 16 Nov 2017 13:09:33 -0800
"Raj, Ashok" <[email protected]> wrote:
> Hi Alex
>
> On Thu, Nov 16, 2017 at 02:32:44PM -0700, Alex Williamson wrote:
> > On Wed, 15 Nov 2017 15:54:56 -0800
> > Jacob Pan <[email protected]> wrote:
> >
> > > Hi Alex and all,
> > >
> > > Just wondering if you could merge Robin's patch for the next rc. From
> > > all our testing, this seems to be a solid fix and should be included in
> > > the stable releases as well.
> >
> > Hi Jacob,
> >
> > Sorry, this wasn't on my radar, I only scanned for patches back through
> > about when Joerg refreshed his next branch (others on the list speak up
> > if I didn't pickup your patches for the v4.15 merge window).
> >
> > This patch makes sense to me and I'm glad you were able to work through
> > the anomaly Harsh saw in testing as an unrelated issue, but...
> >
> >
> > What do we do about this? I certainly can't rip out large page support
> > and put a stable tag on the patch. I'm not really spotting what's
> > wrong with large page support here, other than the comment about it
> > being a mess. Suggestions? Thanks,
> >
>
> Largepage seems to work and i don't think we need to rip it out. When
> Harsh tested it at one point we thought disabling super-page seemed to make
> the problem go away. Jacob tested and we still saw the need for Robin's patch.
>
> Yes, the function looks humongous but i don't think we should wait for that
> before this merge.
Ok. Who wants to toss in review and testing sign-offs? Clearly
there's been a lot more eyes and effort on this patch than reflected in
the original posting. I'll add a stable cc. Thanks,
Alex
From 1584294273113614288@xxx Fri Nov 17 06:39:17 +0000 2017
X-GM-THRID: 1579793157804914964
X-Gmail-Labels: Inbox,Category Forums,HistoricalUnread
Hi Alex
On Thu, Nov 16, 2017 at 02:32:44PM -0700, Alex Williamson wrote:
> On Wed, 15 Nov 2017 15:54:56 -0800
> Jacob Pan <[email protected]> wrote:
>
> > Hi Alex and all,
> >
> > Just wondering if you could merge Robin's patch for the next rc. From
> > all our testing, this seems to be a solid fix and should be included in
> > the stable releases as well.
>
> Hi Jacob,
>
> Sorry, this wasn't on my radar, I only scanned for patches back through
> about when Joerg refreshed his next branch (others on the list speak up
> if I didn't pickup your patches for the v4.15 merge window).
>
> This patch makes sense to me and I'm glad you were able to work through
> the anomaly Harsh saw in testing as an unrelated issue, but...
>
>
> What do we do about this? I certainly can't rip out large page support
> and put a stable tag on the patch. I'm not really spotting what's
> wrong with large page support here, other than the comment about it
> being a mess. Suggestions? Thanks,
>
Largepage seems to work and i don't think we need to rip it out. When
Harsh tested it at one point we thought disabling super-page seemed to make
the problem go away. Jacob tested and we still saw the need for Robin's patch.
Yes, the function looks humongous but i don't think we should wait for that
before this merge.
Cheers,
Ashok
From 1584273787997254500@xxx Fri Nov 17 01:13:41 +0000 2017
X-GM-THRID: 1579793157804914964
X-Gmail-Labels: Inbox,Category Forums,HistoricalUnread
On Wed, 15 Nov 2017 15:54:56 -0800
Jacob Pan <[email protected]> wrote:
> Hi Alex and all,
>
> Just wondering if you could merge Robin's patch for the next rc. From
> all our testing, this seems to be a solid fix and should be included in
> the stable releases as well.
Hi Jacob,
Sorry, this wasn't on my radar, I only scanned for patches back through
about when Joerg refreshed his next branch (others on the list speak up
if I didn't pickup your patches for the v4.15 merge window).
This patch makes sense to me and I'm glad you were able to work through
the anomaly Harsh saw in testing as an unrelated issue, but...
> On Mon, 6 Nov 2017 10:47:09 -0800
> Jacob Pan <[email protected]> wrote:
>
> > On Fri, 6 Oct 2017 16:43:09 +0200
> > Joerg Roedel <[email protected]> wrote:
> >
> > > On Tue, Oct 03, 2017 at 07:05:17PM +0100, Robin Murphy wrote:
> > > > Now, there are indeed plenty of drivers and subsystems which do
> > > > work on lists of explicitly single pages - anything doing some
> > > > variant of "addr = kmap_atomic(sg_page(sg)) + sg->offset;" is easy
> > > > to spot - but I don't think DMA API implementations are in a
> > > > position to make any kind of assumption; nearly all of them just
> > > > shut up and handle sg->length bytes from sg_phys(sg) without
> > > > questioning the caller, and I reckon that's exactly what they
> > > > should be doing.
> > >
> > > I agree with that, it is not explicitly forbidden to have an
> > > sg->offset > PAGE_SIZE and most IOMMU drivers handle this case.
> > >
> > > So this is a problem I'd like to see resolved in the VT-d driver
> > > too. If nobody comes up with a correct fix soon I'll apply this one
> > > and rip out the large-page support from __domain_mapping() to make
> > > it work.
What do we do about this? I certainly can't rip out large page support
and put a stable tag on the patch. I'm not really spotting what's
wrong with large page support here, other than the comment about it
being a mess. Suggestions? Thanks,
Alex
From 1584182449587142028@xxx Thu Nov 16 01:01:54 +0000 2017
X-GM-THRID: 1579793157804914964
X-Gmail-Labels: Inbox,Category Forums,HistoricalUnread
Hi Alex and all,
Just wondering if you could merge Robin's patch for the next rc. From
all our testing, this seems to be a solid fix and should be included in
the stable releases as well.
Thanks,
Jacob
On Mon, 6 Nov 2017 10:47:09 -0800
Jacob Pan <[email protected]> wrote:
> On Fri, 6 Oct 2017 16:43:09 +0200
> Joerg Roedel <[email protected]> wrote:
>
> > On Tue, Oct 03, 2017 at 07:05:17PM +0100, Robin Murphy wrote:
> > > Now, there are indeed plenty of drivers and subsystems which do
> > > work on lists of explicitly single pages - anything doing some
> > > variant of "addr = kmap_atomic(sg_page(sg)) + sg->offset;" is easy
> > > to spot - but I don't think DMA API implementations are in a
> > > position to make any kind of assumption; nearly all of them just
> > > shut up and handle sg->length bytes from sg_phys(sg) without
> > > questioning the caller, and I reckon that's exactly what they
> > > should be doing.
> >
> > I agree with that, it is not explicitly forbidden to have an
> > sg->offset > PAGE_SIZE and most IOMMU drivers handle this case.
> >
> > So this is a problem I'd like to see resolved in the VT-d driver
> > too. If nobody comes up with a correct fix soon I'll apply this one
> > and rip out the large-page support from __domain_mapping() to make
> > it work.
> >
> Hi All,
>
> Just to give an update on the offline debugging of this issue. With
> Robin's patch applied, I was able to reproduce the failure with
> similar configuration that Jain helped to set up.
>
> I added trace prints just to see the map/unmap activities leading to
> the DMAR fault. When fault occurs, the trace shows there is an unmap
> to the offending iova pfn. So I think this is a separate problem than
> Robin's patch is fixing. I think we should move forward to merge this
> patch upstream and stable. The remaining problem is likely a race
> condition between unmap and DMA activities.
>
> Here a brief extracted log, ee3d7 is the iova pfn in question.
> #1. map sg pfn ee3d7
> <idle>-0 [076] 74124.154254: bprint:
> __domain_mapping: vpfn:ee3d7, pgoff=2126, np:1, da:ee3d784e,
> len:1464 ,
> ppfn:1849c9c
>
> #2. unmap ee3d7000
> <idle>-0 [054] 74124.154301: bprint:
> intel_unmap: Device 0000:18:00.4 unmapping: pfn ee3d7-ee3d7
> <idle>-0 [076] 74124.154301: bprint:
> __domain_mapping: lvlpg:1, nrpg 0, vpfn:ec2ff, ppfn:183221a, sg_res:0
> <idle>-0 [059] 74124.154302: bprint:
> __domain_mapping: lvlpg:1, nrpg 0, vpfn:ee719, ppfn:c3e4dd, sg_res:0
> <idle>-0 [076] 74124.154302: bprint:
> __domain_mapping: vpfn:f183b, pgoff=78, np:1, da:f183b04e, len:1464,
>
> #3. DMA to unmapped address ee3d7000, DMAR fault raised.
> +2.952861] dmar_fault: 6 callbacks
> suppressed +0.000002] DMAR: DRHD: handling fault status reg
> 2 +0.005588] turning tracing
> off +0.003592] DMAR: [DMA Write] Request device [18:00.4] fault addr
> ee3d7000 [fault reason 05] PTE Write access is not set
>
> <idle>-0 [000] 74124.156906: bputs:
> 0xffffffffb259916bs: turning tracing off
>
>
> Thanks,
>
> Jacob
>
> > Speaking of __domain_mapping(), this function is a big
> > unmaintainable mess which should be split and rewritten. A clean
> > and maintainable rewrite can alse re-add the large-page support.
> >
> >
> > Regards,
> >
> > Joerg
> >
> > _______________________________________________
> > iommu mailing list
> > [email protected]
> > https://lists.linuxfoundation.org/mailman/listinfo/iommu
>
> [Jacob Pan]
[Jacob Pan]
From 1583343548761122495@xxx Mon Nov 06 18:47:56 +0000 2017
X-GM-THRID: 1579793157804914964
X-Gmail-Labels: Inbox,Category Forums,HistoricalUnread
On Fri, 6 Oct 2017 16:43:09 +0200
Joerg Roedel <[email protected]> wrote:
> On Tue, Oct 03, 2017 at 07:05:17PM +0100, Robin Murphy wrote:
> > Now, there are indeed plenty of drivers and subsystems which do
> > work on lists of explicitly single pages - anything doing some
> > variant of "addr = kmap_atomic(sg_page(sg)) + sg->offset;" is easy
> > to spot - but I don't think DMA API implementations are in a
> > position to make any kind of assumption; nearly all of them just
> > shut up and handle sg->length bytes from sg_phys(sg) without
> > questioning the caller, and I reckon that's exactly what they
> > should be doing.
>
> I agree with that, it is not explicitly forbidden to have an
> sg->offset > PAGE_SIZE and most IOMMU drivers handle this case.
>
> So this is a problem I'd like to see resolved in the VT-d driver too.
> If nobody comes up with a correct fix soon I'll apply this one and
> rip out the large-page support from __domain_mapping() to make it
> work.
>
Hi All,
Just to give an update on the offline debugging of this issue. With
Robin's patch applied, I was able to reproduce the failure with
similar configuration that Jain helped to set up.
I added trace prints just to see the map/unmap activities leading to
the DMAR fault. When fault occurs, the trace shows there is an unmap to
the offending iova pfn. So I think this is a separate problem than
Robin's patch is fixing. I think we should move forward to merge this
patch upstream and stable. The remaining problem is likely a race
condition between unmap and DMA activities.
Here a brief extracted log, ee3d7 is the iova pfn in question.
#1. map sg pfn ee3d7
<idle>-0 [076] 74124.154254: bprint: __domain_mapping: vpfn:ee3d7, pgoff=2126, np:1, da:ee3d784e, len:1464
, ppfn:1849c9c
#2. unmap ee3d7000
<idle>-0 [054] 74124.154301: bprint: intel_unmap: Device 0000:18:00.4 unmapping: pfn ee3d7-ee3d7
<idle>-0 [076] 74124.154301: bprint: __domain_mapping: lvlpg:1, nrpg 0, vpfn:ec2ff, ppfn:183221a, sg_res:0
<idle>-0 [059] 74124.154302: bprint: __domain_mapping: lvlpg:1, nrpg 0, vpfn:ee719, ppfn:c3e4dd, sg_res:0
<idle>-0 [076] 74124.154302: bprint: __domain_mapping: vpfn:f183b, pgoff=78, np:1, da:f183b04e, len:1464,
#3. DMA to unmapped address ee3d7000, DMAR fault raised.
+2.952861] dmar_fault: 6 callbacks suppressed
+0.000002] DMAR: DRHD: handling fault status reg 2
+0.005588] turning tracing off
+0.003592] DMAR: [DMA Write] Request device [18:00.4] fault addr ee3d7000 [fault reason 05] PTE Write access is not set
<idle>-0 [000] 74124.156906: bputs:
0xffffffffb259916bs: turning tracing off
Thanks,
Jacob
> Speaking of __domain_mapping(), this function is a big unmaintainable
> mess which should be split and rewritten. A clean and maintainable
> rewrite can alse re-add the large-page support.
>
>
> Regards,
>
> Joerg
>
> _______________________________________________
> iommu mailing list
> [email protected]
> https://lists.linuxfoundation.org/mailman/listinfo/iommu
[Jacob Pan]
From 1580525593411860368@xxx Fri Oct 06 16:17:44 +0000 2017
X-GM-THRID: 1579793157804914964
X-Gmail-Labels: Inbox,Category Forums
On Fri, Oct 06, 2017 at 04:43:09PM +0200, Joerg Roedel wrote:
> On Tue, Oct 03, 2017 at 07:05:17PM +0100, Robin Murphy wrote:
> > Now, there are indeed plenty of drivers and subsystems which do work on
> > lists of explicitly single pages - anything doing some variant of
> > "addr = kmap_atomic(sg_page(sg)) + sg->offset;" is easy to spot - but I
> > don't think DMA API implementations are in a position to make any kind
> > of assumption; nearly all of them just shut up and handle sg->length
> > bytes from sg_phys(sg) without questioning the caller, and I reckon
> > that's exactly what they should be doing.
>
> I agree with that, it is not explicitly forbidden to have an
> sg->offset > PAGE_SIZE and most IOMMU drivers handle this case.
>
> So this is a problem I'd like to see resolved in the VT-d driver too. If
> nobody comes up with a correct fix soon I'll apply this one and rip out
> the large-page support from __domain_mapping() to make it work.
That seems like a good start. I have reviewed Robin's fix and it seems
to make sense. I'll start looking at making __domain_mapping()
to make it more manageable.
>
> Speaking of __domain_mapping(), this function is a big unmaintainable
> mess which should be split and rewritten. A clean and maintainable
> rewrite can alse re-add the large-page support.
>
>
> Regards,
>
> Joerg
>
From 1580519685599825798@xxx Fri Oct 06 14:43:50 +0000 2017
X-GM-THRID: 1579793157804914964
X-Gmail-Labels: Inbox,Category Forums
On Tue, Oct 03, 2017 at 07:05:17PM +0100, Robin Murphy wrote:
> Now, there are indeed plenty of drivers and subsystems which do work on
> lists of explicitly single pages - anything doing some variant of
> "addr = kmap_atomic(sg_page(sg)) + sg->offset;" is easy to spot - but I
> don't think DMA API implementations are in a position to make any kind
> of assumption; nearly all of them just shut up and handle sg->length
> bytes from sg_phys(sg) without questioning the caller, and I reckon
> that's exactly what they should be doing.
I agree with that, it is not explicitly forbidden to have an
sg->offset > PAGE_SIZE and most IOMMU drivers handle this case.
So this is a problem I'd like to see resolved in the VT-d driver too. If
nobody comes up with a correct fix soon I'll apply this one and rip out
the large-page support from __domain_mapping() to make it work.
Speaking of __domain_mapping(), this function is a big unmaintainable
mess which should be split and rewritten. A clean and maintainable
rewrite can alse re-add the large-page support.
Regards,
Joerg
From 1580325715455181105@xxx Wed Oct 04 11:20:46 +0000 2017
X-GM-THRID: 1579793157804914964
X-Gmail-Labels: Inbox,Category Forums
On 03/10/17 23:16, David Woodhouse wrote:
> On Tue, 2017-10-03 at 19:05 +0100, Robin Murphy wrote:
>>
>> Now, there are indeed plenty of drivers and subsystems which do work on
>> lists of explicitly single pages - anything doing some variant of
>> "addr = kmap_atomic(sg_page(sg)) + sg->offset;" is easy to spot - but I
>> don't think DMA API implementations are in a position to make any kind
>> of assumption; nearly all of them just shut up and handle sg->length
>> bytes from sg_phys(sg) without questioning the caller, and I reckon
>> that's exactly what they should be doing.
>
> So what's the point in sg->page in the first place? If even the
> *offset* can be greater than page size, it isn't even the *first* page
> (as you called it). Why aren't we just using a physical address,
> instead of an arbitrary page and an offset from that?
To nitpick, "the first of one or more contiguous pages" does not imply
"the first page of the buffer" - the buffer just lies *somewhere* within
that run of pages.
Historically, it looks like page+offset first came about in the 2.5 era
to support highmem[1] - note that scatterlist users still only really
care about virtual and DMA address, not physical. Sure, it wouldn't be
entirely impossible to use phys_addr_t instead, but at this point it
would be a massively invasive change, and would make implementations of
one DMA API callback a tiny bit simpler at the cost of pushing rather
more complexity out to hundreds of other users, some of whom might not
even call dma_map_sg().
The fact is, regardless of how much sense it does or doesn't make, a
fair amount of scatterlist-mangling code exists that is capable of
generating silly offsets, and it's so simple to make sure the DMA API
implementations don't care that I'd rather just keep on top of that than
go off on a crusade in attempt to wipe out the grey areas.
> Can we have *negative* sg->offset too? :)
unsigned int offset;
I'm gonna say no ;)
Robin.
[1]:https://git.kernel.org/pub/scm/linux/kernel/git/history/history.git/tree/include/asm-i386/scatterlist.h?h=v2.5.0
From 1580278521340246074@xxx Tue Oct 03 22:50:38 +0000 2017
X-GM-THRID: 1579793157804914964
X-Gmail-Labels: Inbox,Category Forums
Hi Robin
I now see your patch and it does seem to be fix the problem.
On Thu, Sep 28, 2017 at 08:43:46AM -0700, Ashok Raj wrote:
> Hi Robin
>
>
> On Thu, Sep 28, 2017 at 05:59:11PM +0100, Robin Murphy wrote:
> > I hope our email server hasn't got blacklisted again... Said patch is
> > the top of this very thread we're replying on[1] - you were definitely
> > on cc :(
>
> (sg->dma_address + sg->dma_len) ----+
> sg->dma_address ---------+ |
> iov_pfn------+ | |
> | | |
> v v v
> iova: a b c d e f
> |--------|--------|--------|--------|--------|
> <...calculated....>
> [_____mapped______]
> pfn: 0 1 2 3 4 5
> |--------|--------|--------|--------|--------|
> ^ ^ ^
> | | |
> sg->page ----+ | |
> sg->offset --------------+ |
> (sg->offset + sg->length) ----------+
>
> The picture seems right. Looking at the code i'm not sure if i understand
> it correctly.
>
> pgoff = sg->offset & ~PAGE_MASK;
>
> this gets the offset past the start of page.
>
> sg->dma_address = ((dma_addr_t)iov_pfn << VTD_PAGE_SHIFT) + pgoff;
>
> this would set dma_address at b+off instead of starting at c+off correct?
I assumed align_nrpages() would allocate 2 pages when
the offset is over PAGE_SIZE, but it seems economical and only
allocates 1 page to cover just the sg->length bytes.
the pteval also seems correct now, since you changed to
sg_phys() that already accounts for sg->offset, so you
subtract pgoff.
- pteval = page_to_phys(sg_page(sg)) | prot;
+ pteval = (sg_phys(sg) - pgoff) | prot;
Cheers,
Ashok
From 1580276803205734928@xxx Tue Oct 03 22:23:20 +0000 2017
X-GM-THRID: 1579793157804914964
X-Gmail-Labels: Inbox,Category Forums
| From: Harsh Jain <[email protected]>
| Sent: Tuesday, October 3, 2017 5:22 AM
|
| Hi Robin/Ashok,
|
| Find attached trace of DMA write error. I had a look on trace but didn't
| find anything suspicious.
|
| Let me know if you need more trace.
As a reminder, Harsh and Atul will be waking up in a few hours, so if there
are additional tests for which you'd like them to gather data, it would be
good to ask now so it's available to them to work on while we're all off
asleep ...
Casey
From 1580276425195801971@xxx Tue Oct 03 22:17:19 +0000 2017
X-GM-THRID: 1579793157804914964
X-Gmail-Labels: Inbox,Category Forums
On Tue, 2017-10-03 at 19:05 +0100, Robin Murphy wrote:
>
> Now, there are indeed plenty of drivers and subsystems which do work on
> lists of explicitly single pages - anything doing some variant of
> "addr = kmap_atomic(sg_page(sg)) + sg->offset;" is easy to spot - but I
> don't think DMA API implementations are in a position to make any kind
> of assumption; nearly all of them just shut up and handle sg->length
> bytes from sg_phys(sg) without questioning the caller, and I reckon
> that's exactly what they should be doing.
So what's the point in sg->page in the first place? If even the
*offset* can be greater than page size, it isn't even the *first* page
(as you called it). Why aren't we just using a physical address,
instead of an arbitrary page and an offset from that?
Can we have *negative* sg->offset too? :)
On 03/10/17 13:55, David Woodhouse wrote:
> On Thu, 2017-09-28 at 15:14 +0100, Robin Murphy wrote:
>> The intel-iommu DMA ops fail to correctly handle scatterlists where
>> sg->offset is greater than PAGE_SIZE - the IOVA allocation is computed
>> appropriately based on the page-aligned portion of the offset, but the
>> mapping is set up relative to sg->page, which means it fails to actually
>> cover the whole buffer (and in the worst case doesn't cover it at all):
>>
>> (sg->dma_address + sg->dma_len) ----+
>> sg->dma_address ---------+ |
>> iov_pfn------+ | |
>> | | |
>> v v v
>> iova: a b c d e f
>> |--------|--------|--------|--------|--------|
>> <...calculated....>
>> [_____mapped______]
>> pfn: 0 1 2 3 4 5
>> |--------|--------|--------|--------|--------|
>> ^ ^ ^
>> | | |
>> sg->page ----+ | |
>> sg->offset --------------+ |
>> (sg->offset + sg->length) ----------+
>
> I'd still dearly love to see some clear documentation of what it means
> for sg->offset to be outside the page referenced by sg->page.
I think the key is that for each SG segment, sg->page doesn't
necessarily represent "a" page, but the first of one or more contiguous
pages. Disregarding offsets for the moment, Here's a typical example of
a 120KB buffer from the block layer as processed by iommu_dma_map_sg():
[ 16.092649] == initial (4) ==
[ 16.095591] 0: virt ffff800001372000 phys 0x0000000081372000 dma 0x0000000000000000
[ 16.095591] offset 0x00000000 length 0x0000e000 dma_len 0x00000000
[ 16.109541] 1: virt ffff800001380000 phys 0x0000000081380000 dma 0x0000000000000000
[ 16.109541] offset 0x00000000 length 0x0000d000 dma_len 0x00000000
[ 16.123491] 2: virt ffff80000138e000 phys 0x000000008138e000 dma 0x0000000000000000
[ 16.123491] offset 0x00000000 length 0x00002000 dma_len 0x00000000
[ 16.137440] 3: virt ffff800001390000 phys 0x0000000081390000 dma 0x0000000000000000
[ 16.137440] offset 0x00000000 length 0x00001000 dma_len 0x00000000
[ 16.216167] == final (2) ==
[ 16.219106] 0: virt ffff800001372000 phys 0x0000000081372000 dma 0x00000000ffb60000
[ 16.219106] offset 0x00000000 length 0x0000e000 dma_len 0x0000e000
[ 16.233056] 1: virt ffff800001380000 phys 0x0000000081380000 dma 0x00000000ffb70000
[ 16.233056] offset 0x00000000 length 0x0000d000 dma_len 0x00010000
i.e. segments of 14 pages, 13 pages, 2 pages and 1 page respectively
(and we further merge the resulting DMA-contiguous segments on top of
that).
Now, there are indeed plenty of drivers and subsystems which do work on
lists of explicitly single pages - anything doing some variant of
"addr = kmap_atomic(sg_page(sg)) + sg->offset;" is easy to spot - but I
don't think DMA API implementations are in a position to make any kind
of assumption; nearly all of them just shut up and handle sg->length
bytes from sg_phys(sg) without questioning the caller, and I reckon
that's exactly what they should be doing.
> Or is it really not "outside", and it's *only* valid for the offset to
> be > PAGE_OFFSET when it's a huge page, so we can check that with a
> BUG_ON() ?
>
> In particular, I'd like to know what is intended in the Xen PV case,
> where there isn't a straight correspondence between pfn and mfn. Is the
> out-of-range sg->offset intended to refer to the next *pfn* after sg-
>> page, or to the next *mfn* after sg->page?
Logically, it should mean the same thing as whatever a length of more
than 1 page means to Xen - judging by blkif_queue_rw_req() at least,
that seems to be a BUG_ON() in both cases.
> I confess I've only followed this thread vaguely, but I haven't seen a
> *coherent* explanation except in the huge page case (in which case I
> want to see that BUG_ON in the patch) of why this isn't just totally
> bogus.
As I've said before, I'd certainly consider it a denormalised case, but
not a bogus one, and certainly not something that is the DMA API's job
to police. Having now audited every dma_map_ops::map_sg implementation I
could find, the only ones not using sg_phys()/sg_virt() or some other
construction immune to the absolute offset value (MIPS even explicitly
normalises it) are intel-iommu and arch/frv, and the latter is clearly
broken anyway as it ignores sg->length.
Robin.
From 1580241218132444280@xxx Tue Oct 03 12:57:43 +0000 2017
X-GM-THRID: 1579793157804914964
X-Gmail-Labels: Inbox,Category Forums
On Thu, 2017-09-28 at 15:14 +0100, Robin Murphy wrote:
> The intel-iommu DMA ops fail to correctly handle scatterlists where
> sg->offset is greater than PAGE_SIZE - the IOVA allocation is computed
> appropriately based on the page-aligned portion of the offset, but the
> mapping is set up relative to sg->page, which means it fails to actually
> cover the whole buffer (and in the worst case doesn't cover it at all):
>
> (sg->dma_address + sg->dma_len) ----+
> sg->dma_address ---------+ |
> iov_pfn------+ | |
> | | |
> v v v
> iova: a b c d e f
> |--------|--------|--------|--------|--------|
> <...calculated....>
> [_____mapped______]
> pfn: 0 1 2 3 4 5
> |--------|--------|--------|--------|--------|
> ^ ^ ^
> | | |
> sg->page ----+ | |
> sg->offset --------------+ |
> (sg->offset + sg->length) ----------+
I'd still dearly love to see some clear documentation of what it means
for sg->offset to be outside the page referenced by sg->page.
Or is it really not "outside", and it's *only* valid for the offset to
be > PAGE_OFFSET when it's a huge page, so we can check that with a
BUG_ON() ?
In particular, I'd like to know what is intended in the Xen PV case,
where there isn't a straight correspondence between pfn and mfn. Is the
out-of-range sg->offset intended to refer to the next *pfn* after sg-
>page, or to the next *mfn* after sg->page?
I confess I've only followed this thread vaguely, but I haven't seen a
*coherent* explanation except in the huge page case (in which case I
want to see that BUG_ON in the patch) of why this isn't just totally
bogus.