2023-06-07 22:55:28

by Alexander H Duyck

[permalink] [raw]
Subject: Question about reserved_regions w/ Intel IOMMU

I am running into a DMA issue that appears to be a conflict between
ACS and IOMMU. As per the documentation I can find, the IOMMU is
supposed to create reserved regions for MSI and the memory window
behind the root port. However looking at reserved_regions I am not
seeing that. I only see the reservation for the MSI.

So for example with an enabled NIC and iommu enabled w/o passthru I am seeing:
# cat /sys/bus/pci/devices/0000\:83\:00.0/iommu_group/reserved_regions
0x00000000fee00000 0x00000000feefffff msi

Shouldn't there also be a memory window for the region behind the root
port to prevent any possible peer-to-peer access?


2023-06-07 23:21:24

by Alexander H Duyck

[permalink] [raw]
Subject: Re: Question about reserved_regions w/ Intel IOMMU

On Wed, Jun 7, 2023 at 3:40 PM Alexander Duyck
<[email protected]> wrote:
>
> I am running into a DMA issue that appears to be a conflict between
> ACS and IOMMU. As per the documentation I can find, the IOMMU is
> supposed to create reserved regions for MSI and the memory window
> behind the root port. However looking at reserved_regions I am not
> seeing that. I only see the reservation for the MSI.
>
> So for example with an enabled NIC and iommu enabled w/o passthru I am seeing:
> # cat /sys/bus/pci/devices/0000\:83\:00.0/iommu_group/reserved_regions
> 0x00000000fee00000 0x00000000feefffff msi
>
> Shouldn't there also be a memory window for the region behind the root
> port to prevent any possible peer-to-peer access?

Since the iommu portion of the email bounced I figured I would fix
that and provide some additional info.

I added some instrumentation to the kernel to dump the resources found
in iova_reserve_pci_windows. From what I can tell it is finding the
correct resources for the Memory and Prefetchable regions behind the
root port. It seems to be calling reserve_iova which is successfully
allocating an iova to reserve the region.

However still no luck on why it isn't showing up in reserved_regions.

2023-06-08 03:24:53

by Baolu Lu

[permalink] [raw]
Subject: Re: Question about reserved_regions w/ Intel IOMMU

On 6/8/23 7:03 AM, Alexander Duyck wrote:
> On Wed, Jun 7, 2023 at 3:40 PM Alexander Duyck
> <[email protected]> wrote:
>>
>> I am running into a DMA issue that appears to be a conflict between
>> ACS and IOMMU. As per the documentation I can find, the IOMMU is
>> supposed to create reserved regions for MSI and the memory window
>> behind the root port. However looking at reserved_regions I am not
>> seeing that. I only see the reservation for the MSI.
>>
>> So for example with an enabled NIC and iommu enabled w/o passthru I am seeing:
>> # cat /sys/bus/pci/devices/0000\:83\:00.0/iommu_group/reserved_regions
>> 0x00000000fee00000 0x00000000feefffff msi
>>
>> Shouldn't there also be a memory window for the region behind the root
>> port to prevent any possible peer-to-peer access?
>
> Since the iommu portion of the email bounced I figured I would fix
> that and provide some additional info.
>
> I added some instrumentation to the kernel to dump the resources found
> in iova_reserve_pci_windows. From what I can tell it is finding the
> correct resources for the Memory and Prefetchable regions behind the
> root port. It seems to be calling reserve_iova which is successfully
> allocating an iova to reserve the region.
>
> However still no luck on why it isn't showing up in reserved_regions.

Perhaps I can ask the opposite question, why it should show up in
reserve_regions? Why does the iommu subsystem block any possible peer-
to-peer DMA access? Isn't that a decision of the device driver.

The iova_reserve_pci_windows() you've seen is for kernel DMA interfaces
which is not related to peer-to-peer accesses.

Best regards,
baolu

2023-06-08 14:54:59

by Alexander H Duyck

[permalink] [raw]
Subject: Re: Question about reserved_regions w/ Intel IOMMU

On Wed, Jun 7, 2023 at 8:05 PM Baolu Lu <[email protected]> wrote:
>
> On 6/8/23 7:03 AM, Alexander Duyck wrote:
> > On Wed, Jun 7, 2023 at 3:40 PM Alexander Duyck
> > <[email protected]> wrote:
> >>
> >> I am running into a DMA issue that appears to be a conflict between
> >> ACS and IOMMU. As per the documentation I can find, the IOMMU is
> >> supposed to create reserved regions for MSI and the memory window
> >> behind the root port. However looking at reserved_regions I am not
> >> seeing that. I only see the reservation for the MSI.
> >>
> >> So for example with an enabled NIC and iommu enabled w/o passthru I am seeing:
> >> # cat /sys/bus/pci/devices/0000\:83\:00.0/iommu_group/reserved_regions
> >> 0x00000000fee00000 0x00000000feefffff msi
> >>
> >> Shouldn't there also be a memory window for the region behind the root
> >> port to prevent any possible peer-to-peer access?
> >
> > Since the iommu portion of the email bounced I figured I would fix
> > that and provide some additional info.
> >
> > I added some instrumentation to the kernel to dump the resources found
> > in iova_reserve_pci_windows. From what I can tell it is finding the
> > correct resources for the Memory and Prefetchable regions behind the
> > root port. It seems to be calling reserve_iova which is successfully
> > allocating an iova to reserve the region.
> >
> > However still no luck on why it isn't showing up in reserved_regions.
>
> Perhaps I can ask the opposite question, why it should show up in
> reserve_regions? Why does the iommu subsystem block any possible peer-
> to-peer DMA access? Isn't that a decision of the device driver.
>
> The iova_reserve_pci_windows() you've seen is for kernel DMA interfaces
> which is not related to peer-to-peer accesses.

The problem is if the IOVA overlaps with the physical addresses of
other devices that can be routed to via ACS redirect. As such if ACS
redirect is enabled a host IOVA could be directed to another device on
the switch instead. To prevent that we need to reserve those addresses
to avoid address space collisions.

From what I can tell it looks like the IOVA should be reserved, but I
don't see it showing up anywhere in reserved_regions. What I am
wondering is if iova_reserve_pci_windows() should be taking some steps
so that it will appear, or if intel_iommu_get_resv_regions() needs to
have some code similar to iova_reserve_pci_windows() to get the ranges
and verify they are reserved in the IOVA.

2023-06-08 15:45:18

by Robin Murphy

[permalink] [raw]
Subject: Re: Question about reserved_regions w/ Intel IOMMU

On 2023-06-08 04:03, Baolu Lu wrote:
> On 6/8/23 7:03 AM, Alexander Duyck wrote:
>> On Wed, Jun 7, 2023 at 3:40 PM Alexander Duyck
>> <[email protected]> wrote:
>>>
>>> I am running into a DMA issue that appears to be a conflict between
>>> ACS and IOMMU. As per the documentation I can find, the IOMMU is
>>> supposed to create reserved regions for MSI and the memory window
>>> behind the root port. However looking at reserved_regions I am not
>>> seeing that. I only see the reservation for the MSI.
>>>
>>> So for example with an enabled NIC and iommu enabled w/o passthru I
>>> am seeing:
>>> # cat /sys/bus/pci/devices/0000\:83\:00.0/iommu_group/reserved_regions
>>> 0x00000000fee00000 0x00000000feefffff msi
>>>
>>> Shouldn't there also be a memory window for the region behind the root
>>> port to prevent any possible peer-to-peer access?
>>
>> Since the iommu portion of the email bounced I figured I would fix
>> that and provide some additional info.
>>
>> I added some instrumentation to the kernel to dump the resources found
>> in iova_reserve_pci_windows. From what I can tell it is finding the
>> correct resources for the Memory and Prefetchable regions behind the
>> root port. It seems to be calling reserve_iova which is successfully
>> allocating an iova to reserve the region.
>>
>> However still no luck on why it isn't showing up in reserved_regions.
>
> Perhaps I can ask the opposite question, why it should show up in
> reserve_regions? Why does the iommu subsystem block any possible peer-
> to-peer DMA access? Isn't that a decision of the device driver.
>
> The iova_reserve_pci_windows() you've seen is for kernel DMA interfaces
> which is not related to peer-to-peer accesses.

Right, in general the IOMMU driver cannot be held responsible for
whatever might happen upstream of the IOMMU input. The DMA layer carves
PCI windows out of its IOVA space unconditionally because we know that
they *might* be problematic, and we don't have any specific constraints
on our IOVA layout so it's no big deal to just sacrifice some space for
simplicity. We don't want to have to go digging any further into
bus-specific code to reason about whether the right ACS capabilities are
present and enabled everywhere to prevent direct P2P or not. Other
use-cases may have different requirements, though, so it's up to them
what they want to do.

It's conceptually pretty much the same as the case where the device (or
indeed a PCI host bridge or other interconnect segment in-between) has a
constrained DMA address width - the device may not be able to access all
of the address space that the IOMMU provides, but the IOMMU itself can't
tell you that.

Thanks,
Robin.

2023-06-08 15:45:36

by Ashok Raj

[permalink] [raw]
Subject: Re: Question about reserved_regions w/ Intel IOMMU

On Thu, Jun 08, 2023 at 07:33:31AM -0700, Alexander Duyck wrote:
> On Wed, Jun 7, 2023 at 8:05 PM Baolu Lu <[email protected]> wrote:
> >
> > On 6/8/23 7:03 AM, Alexander Duyck wrote:
> > > On Wed, Jun 7, 2023 at 3:40 PM Alexander Duyck
> > > <[email protected]> wrote:
> > >>
> > >> I am running into a DMA issue that appears to be a conflict between
> > >> ACS and IOMMU. As per the documentation I can find, the IOMMU is
> > >> supposed to create reserved regions for MSI and the memory window
> > >> behind the root port. However looking at reserved_regions I am not
> > >> seeing that. I only see the reservation for the MSI.
> > >>
> > >> So for example with an enabled NIC and iommu enabled w/o passthru I am seeing:
> > >> # cat /sys/bus/pci/devices/0000\:83\:00.0/iommu_group/reserved_regions
> > >> 0x00000000fee00000 0x00000000feefffff msi
> > >>
> > >> Shouldn't there also be a memory window for the region behind the root
> > >> port to prevent any possible peer-to-peer access?
> > >
> > > Since the iommu portion of the email bounced I figured I would fix
> > > that and provide some additional info.
> > >
> > > I added some instrumentation to the kernel to dump the resources found
> > > in iova_reserve_pci_windows. From what I can tell it is finding the
> > > correct resources for the Memory and Prefetchable regions behind the
> > > root port. It seems to be calling reserve_iova which is successfully
> > > allocating an iova to reserve the region.
> > >
> > > However still no luck on why it isn't showing up in reserved_regions.
> >
> > Perhaps I can ask the opposite question, why it should show up in
> > reserve_regions? Why does the iommu subsystem block any possible peer-
> > to-peer DMA access? Isn't that a decision of the device driver.
> >
> > The iova_reserve_pci_windows() you've seen is for kernel DMA interfaces
> > which is not related to peer-to-peer accesses.
>
> The problem is if the IOVA overlaps with the physical addresses of
> other devices that can be routed to via ACS redirect. As such if ACS
> redirect is enabled a host IOVA could be directed to another device on
> the switch instead. To prevent that we need to reserve those addresses
> to avoid address space collisions.

Any untranslated address from a device must be forwarded to the IOMMU when
ACS is enabled correct? I guess if you want true p2p, then you would need
to map so that the hpa turns into the peer address.. but its always a round
trip to IOMMU.

>
> From what I can tell it looks like the IOVA should be reserved, but I
> don't see it showing up anywhere in reserved_regions. What I am
> wondering is if iova_reserve_pci_windows() should be taking some steps
> so that it will appear, or if intel_iommu_get_resv_regions() needs to
> have some code similar to iova_reserve_pci_windows() to get the ranges
> and verify they are reserved in the IOVA.
>

2023-06-08 17:28:07

by Alexander H Duyck

[permalink] [raw]
Subject: Re: Question about reserved_regions w/ Intel IOMMU

On Thu, Jun 8, 2023 at 8:40 AM Ashok Raj <[email protected]> wrote:
>
> On Thu, Jun 08, 2023 at 07:33:31AM -0700, Alexander Duyck wrote:
> > On Wed, Jun 7, 2023 at 8:05 PM Baolu Lu <[email protected]> wrote:
> > >
> > > On 6/8/23 7:03 AM, Alexander Duyck wrote:
> > > > On Wed, Jun 7, 2023 at 3:40 PM Alexander Duyck
> > > > <[email protected]> wrote:
> > > >>
> > > >> I am running into a DMA issue that appears to be a conflict between
> > > >> ACS and IOMMU. As per the documentation I can find, the IOMMU is
> > > >> supposed to create reserved regions for MSI and the memory window
> > > >> behind the root port. However looking at reserved_regions I am not
> > > >> seeing that. I only see the reservation for the MSI.
> > > >>
> > > >> So for example with an enabled NIC and iommu enabled w/o passthru I am seeing:
> > > >> # cat /sys/bus/pci/devices/0000\:83\:00.0/iommu_group/reserved_regions
> > > >> 0x00000000fee00000 0x00000000feefffff msi
> > > >>
> > > >> Shouldn't there also be a memory window for the region behind the root
> > > >> port to prevent any possible peer-to-peer access?
> > > >
> > > > Since the iommu portion of the email bounced I figured I would fix
> > > > that and provide some additional info.
> > > >
> > > > I added some instrumentation to the kernel to dump the resources found
> > > > in iova_reserve_pci_windows. From what I can tell it is finding the
> > > > correct resources for the Memory and Prefetchable regions behind the
> > > > root port. It seems to be calling reserve_iova which is successfully
> > > > allocating an iova to reserve the region.
> > > >
> > > > However still no luck on why it isn't showing up in reserved_regions.
> > >
> > > Perhaps I can ask the opposite question, why it should show up in
> > > reserve_regions? Why does the iommu subsystem block any possible peer-
> > > to-peer DMA access? Isn't that a decision of the device driver.
> > >
> > > The iova_reserve_pci_windows() you've seen is for kernel DMA interfaces
> > > which is not related to peer-to-peer accesses.
> >
> > The problem is if the IOVA overlaps with the physical addresses of
> > other devices that can be routed to via ACS redirect. As such if ACS
> > redirect is enabled a host IOVA could be directed to another device on
> > the switch instead. To prevent that we need to reserve those addresses
> > to avoid address space collisions.

Our test case is just to perform DMA to/from the host on one device on
a switch and what we are seeing is that when we hit an IOVA that
matches up with the physical address of the neighboring devices BAR0
then we are seeing an AER followed by a hot reset.

> Any untranslated address from a device must be forwarded to the IOMMU when
> ACS is enabled correct?I guess if you want true p2p, then you would need
> to map so that the hpa turns into the peer address.. but its always a round
> trip to IOMMU.

This assumes all parts are doing the Request Redirect "correctly". In
our case there is a PCIe switch we are trying to debug and we have a
few working theories. One concern I have is that the switch may be
throwing an ACS violation for us using an address that matches a
neighboring device instead of redirecting it to the upstream port. If
we pull the switch and just run on the root complex the issue seems to
be resolved so I started poking into the code which led me to the
documentation pointing out what is supposed to be reserved based on
the root complex and MSI regions.

As a part of going down that rabbit hole I realized that the
reserved_regions seems to only list the MSI reservation. However after
digging a bit deeper it seems like there is code to reserve the memory
behind the root complex in the IOVA but it doesn't look like that is
visible anywhere and is the piece I am currently trying to sort out.
What I am working on is trying to figure out if the system that is
failing is actually reserving that memory region in the IOVA, or if
that is somehow not happening in our test setup.

2023-06-08 17:59:58

by Raj, Ashok

[permalink] [raw]
Subject: Re: Question about reserved_regions w/ Intel IOMMU

On Thu, Jun 08, 2023 at 10:10:54AM -0700, Alexander Duyck wrote:
> On Thu, Jun 8, 2023 at 8:40 AM Ashok Raj <[email protected]> wrote:
> >
> > On Thu, Jun 08, 2023 at 07:33:31AM -0700, Alexander Duyck wrote:
> > > On Wed, Jun 7, 2023 at 8:05 PM Baolu Lu <[email protected]> wrote:
> > > >
> > > > On 6/8/23 7:03 AM, Alexander Duyck wrote:
> > > > > On Wed, Jun 7, 2023 at 3:40 PM Alexander Duyck
> > > > > <[email protected]> wrote:
> > > > >>
> > > > >> I am running into a DMA issue that appears to be a conflict between
> > > > >> ACS and IOMMU. As per the documentation I can find, the IOMMU is
> > > > >> supposed to create reserved regions for MSI and the memory window
> > > > >> behind the root port. However looking at reserved_regions I am not
> > > > >> seeing that. I only see the reservation for the MSI.
> > > > >>
> > > > >> So for example with an enabled NIC and iommu enabled w/o passthru I am seeing:
> > > > >> # cat /sys/bus/pci/devices/0000\:83\:00.0/iommu_group/reserved_regions
> > > > >> 0x00000000fee00000 0x00000000feefffff msi
> > > > >>
> > > > >> Shouldn't there also be a memory window for the region behind the root
> > > > >> port to prevent any possible peer-to-peer access?
> > > > >
> > > > > Since the iommu portion of the email bounced I figured I would fix
> > > > > that and provide some additional info.
> > > > >
> > > > > I added some instrumentation to the kernel to dump the resources found
> > > > > in iova_reserve_pci_windows. From what I can tell it is finding the
> > > > > correct resources for the Memory and Prefetchable regions behind the
> > > > > root port. It seems to be calling reserve_iova which is successfully
> > > > > allocating an iova to reserve the region.
> > > > >
> > > > > However still no luck on why it isn't showing up in reserved_regions.
> > > >
> > > > Perhaps I can ask the opposite question, why it should show up in
> > > > reserve_regions? Why does the iommu subsystem block any possible peer-
> > > > to-peer DMA access? Isn't that a decision of the device driver.
> > > >
> > > > The iova_reserve_pci_windows() you've seen is for kernel DMA interfaces
> > > > which is not related to peer-to-peer accesses.
> > >
> > > The problem is if the IOVA overlaps with the physical addresses of
> > > other devices that can be routed to via ACS redirect. As such if ACS
> > > redirect is enabled a host IOVA could be directed to another device on
> > > the switch instead. To prevent that we need to reserve those addresses
> > > to avoid address space collisions.
>
> Our test case is just to perform DMA to/from the host on one device on
> a switch and what we are seeing is that when we hit an IOVA that
> matches up with the physical address of the neighboring devices BAR0
> then we are seeing an AER followed by a hot reset.

ACS is always confusing.. Does your NIC have a DTLB?

If request redirect is set, and the Egress is enabled, then all
transactions should go upstream to the root-port->IOMMU before being
served.

In my 6.0 spec its in 6.12.3 ACS Peer-to-Peer Control Interactions?

And maybe lspci would show how things are setup in the switch?

>
> > Any untranslated address from a device must be forwarded to the IOMMU when
> > ACS is enabled correct?I guess if you want true p2p, then you would need
> > to map so that the hpa turns into the peer address.. but its always a round
> > trip to IOMMU.
>
> This assumes all parts are doing the Request Redirect "correctly". In
> our case there is a PCIe switch we are trying to debug and we have a
> few working theories. One concern I have is that the switch may be
> throwing an ACS violation for us using an address that matches a
> neighboring device instead of redirecting it to the upstream port. If
> we pull the switch and just run on the root complex the issue seems to
> be resolved so I started poking into the code which led me to the
> documentation pointing out what is supposed to be reserved based on
> the root complex and MSI regions.
>
> As a part of going down that rabbit hole I realized that the
> reserved_regions seems to only list the MSI reservation. However after
> digging a bit deeper it seems like there is code to reserve the memory
> behind the root complex in the IOVA but it doesn't look like that is
> visible anywhere and is the piece I am currently trying to sort out.
> What I am working on is trying to figure out if the system that is
> failing is actually reserving that memory region in the IOVA, or if
> that is somehow not happening in our test setup.

I suspect with IOMMU, there is no need to pluck holes like we do for the
MSI. In very early code in IOMMU i vaguely recall we did that, but our
knowledge on ACS was weak. (not that has improved :-)).

Knowing how the switch and root ports are setup with forwarding may help
with some clues. The easy option is maybe forcibly adding to the reserved
range may help to see if you don't see the ACS violation.

Baolu might have some better ideas.

--
Cheers,
Ashok

Bike Shedding: (a.k.a Parkinson's Law of Triviality)
- When the discussion on a topic is inversely proportionate to the gravity of
the topic.

2023-06-08 18:45:18

by Robin Murphy

[permalink] [raw]
Subject: Re: Question about reserved_regions w/ Intel IOMMU

On 2023-06-08 18:10, Alexander Duyck wrote:
> On Thu, Jun 8, 2023 at 8:40 AM Ashok Raj <[email protected]> wrote:
>>
>> On Thu, Jun 08, 2023 at 07:33:31AM -0700, Alexander Duyck wrote:
>>> On Wed, Jun 7, 2023 at 8:05 PM Baolu Lu <[email protected]> wrote:
>>>>
>>>> On 6/8/23 7:03 AM, Alexander Duyck wrote:
>>>>> On Wed, Jun 7, 2023 at 3:40 PM Alexander Duyck
>>>>> <[email protected]> wrote:
>>>>>>
>>>>>> I am running into a DMA issue that appears to be a conflict between
>>>>>> ACS and IOMMU. As per the documentation I can find, the IOMMU is
>>>>>> supposed to create reserved regions for MSI and the memory window
>>>>>> behind the root port. However looking at reserved_regions I am not
>>>>>> seeing that. I only see the reservation for the MSI.
>>>>>>
>>>>>> So for example with an enabled NIC and iommu enabled w/o passthru I am seeing:
>>>>>> # cat /sys/bus/pci/devices/0000\:83\:00.0/iommu_group/reserved_regions
>>>>>> 0x00000000fee00000 0x00000000feefffff msi
>>>>>>
>>>>>> Shouldn't there also be a memory window for the region behind the root
>>>>>> port to prevent any possible peer-to-peer access?
>>>>>
>>>>> Since the iommu portion of the email bounced I figured I would fix
>>>>> that and provide some additional info.
>>>>>
>>>>> I added some instrumentation to the kernel to dump the resources found
>>>>> in iova_reserve_pci_windows. From what I can tell it is finding the
>>>>> correct resources for the Memory and Prefetchable regions behind the
>>>>> root port. It seems to be calling reserve_iova which is successfully
>>>>> allocating an iova to reserve the region.
>>>>>
>>>>> However still no luck on why it isn't showing up in reserved_regions.
>>>>
>>>> Perhaps I can ask the opposite question, why it should show up in
>>>> reserve_regions? Why does the iommu subsystem block any possible peer-
>>>> to-peer DMA access? Isn't that a decision of the device driver.
>>>>
>>>> The iova_reserve_pci_windows() you've seen is for kernel DMA interfaces
>>>> which is not related to peer-to-peer accesses.
>>>
>>> The problem is if the IOVA overlaps with the physical addresses of
>>> other devices that can be routed to via ACS redirect. As such if ACS
>>> redirect is enabled a host IOVA could be directed to another device on
>>> the switch instead. To prevent that we need to reserve those addresses
>>> to avoid address space collisions.
>
> Our test case is just to perform DMA to/from the host on one device on
> a switch and what we are seeing is that when we hit an IOVA that
> matches up with the physical address of the neighboring devices BAR0
> then we are seeing an AER followed by a hot reset.
>
>> Any untranslated address from a device must be forwarded to the IOMMU when
>> ACS is enabled correct?I guess if you want true p2p, then you would need
>> to map so that the hpa turns into the peer address.. but its always a round
>> trip to IOMMU.
>
> This assumes all parts are doing the Request Redirect "correctly". In
> our case there is a PCIe switch we are trying to debug and we have a
> few working theories. One concern I have is that the switch may be
> throwing an ACS violation for us using an address that matches a
> neighboring device instead of redirecting it to the upstream port. If
> we pull the switch and just run on the root complex the issue seems to
> be resolved so I started poking into the code which led me to the
> documentation pointing out what is supposed to be reserved based on
> the root complex and MSI regions.
>
> As a part of going down that rabbit hole I realized that the
> reserved_regions seems to only list the MSI reservation. However after
> digging a bit deeper it seems like there is code to reserve the memory
> behind the root complex in the IOVA but it doesn't look like that is
> visible anywhere and is the piece I am currently trying to sort out.
> What I am working on is trying to figure out if the system that is
> failing is actually reserving that memory region in the IOVA, or if
> that is somehow not happening in our test setup.

How old's the kernel? Before 5.11, intel-iommu wasn't hooked up to
iommu-dma so didn't do quite the same thing - it only reserved whatever
specific PCI memory resources existed at boot, rather than the whole
window as iommu-dma does. Either way, ftrace on reserve_iova() (or just
whack a print in there) should suffice to see what's happened.

Robin.

2023-06-08 18:47:14

by Alexander H Duyck

[permalink] [raw]
Subject: Re: Question about reserved_regions w/ Intel IOMMU

On Thu, Jun 8, 2023 at 10:52 AM Ashok Raj <[email protected]> wrote:
>
> On Thu, Jun 08, 2023 at 10:10:54AM -0700, Alexander Duyck wrote:
> > On Thu, Jun 8, 2023 at 8:40 AM Ashok Raj <[email protected]> wrote:
> > >
> > > On Thu, Jun 08, 2023 at 07:33:31AM -0700, Alexander Duyck wrote:
> > > > On Wed, Jun 7, 2023 at 8:05 PM Baolu Lu <[email protected]> wrote:
> > > > >
> > > > > On 6/8/23 7:03 AM, Alexander Duyck wrote:
> > > > > > On Wed, Jun 7, 2023 at 3:40 PM Alexander Duyck
> > > > > > <[email protected]> wrote:
> > > > > >>
> > > > > >> I am running into a DMA issue that appears to be a conflict between
> > > > > >> ACS and IOMMU. As per the documentation I can find, the IOMMU is
> > > > > >> supposed to create reserved regions for MSI and the memory window
> > > > > >> behind the root port. However looking at reserved_regions I am not
> > > > > >> seeing that. I only see the reservation for the MSI.
> > > > > >>
> > > > > >> So for example with an enabled NIC and iommu enabled w/o passthru I am seeing:
> > > > > >> # cat /sys/bus/pci/devices/0000\:83\:00.0/iommu_group/reserved_regions
> > > > > >> 0x00000000fee00000 0x00000000feefffff msi
> > > > > >>
> > > > > >> Shouldn't there also be a memory window for the region behind the root
> > > > > >> port to prevent any possible peer-to-peer access?
> > > > > >
> > > > > > Since the iommu portion of the email bounced I figured I would fix
> > > > > > that and provide some additional info.
> > > > > >
> > > > > > I added some instrumentation to the kernel to dump the resources found
> > > > > > in iova_reserve_pci_windows. From what I can tell it is finding the
> > > > > > correct resources for the Memory and Prefetchable regions behind the
> > > > > > root port. It seems to be calling reserve_iova which is successfully
> > > > > > allocating an iova to reserve the region.
> > > > > >
> > > > > > However still no luck on why it isn't showing up in reserved_regions.
> > > > >
> > > > > Perhaps I can ask the opposite question, why it should show up in
> > > > > reserve_regions? Why does the iommu subsystem block any possible peer-
> > > > > to-peer DMA access? Isn't that a decision of the device driver.
> > > > >
> > > > > The iova_reserve_pci_windows() you've seen is for kernel DMA interfaces
> > > > > which is not related to peer-to-peer accesses.
> > > >
> > > > The problem is if the IOVA overlaps with the physical addresses of
> > > > other devices that can be routed to via ACS redirect. As such if ACS
> > > > redirect is enabled a host IOVA could be directed to another device on
> > > > the switch instead. To prevent that we need to reserve those addresses
> > > > to avoid address space collisions.
> >
> > Our test case is just to perform DMA to/from the host on one device on
> > a switch and what we are seeing is that when we hit an IOVA that
> > matches up with the physical address of the neighboring devices BAR0
> > then we are seeing an AER followed by a hot reset.
>
> ACS is always confusing.. Does your NIC have a DTLB?

No. It is using the IOMMU for all address translation. I am also
pushing back on the test being used as well. It is always possible
they have implemented something incorrectly and are overrunning a
buffer going into the reserved IOVA region and the overlap is just a
coincidence.

> If request redirect is set, and the Egress is enabled, then all
> transactions should go upstream to the root-port->IOMMU before being
> served.
>
> In my 6.0 spec its in 6.12.3 ACS Peer-to-Peer Control Interactions?
>
> And maybe lspci would show how things are setup in the switch?

We were setting the Redirect Request only, no Egress. I agree, based
on the config everything should just go upstream. However if we
eliminate the switch or put things in passthrough mode the problem
goes away.

> >
> > > Any untranslated address from a device must be forwarded to the IOMMU when
> > > ACS is enabled correct?I guess if you want true p2p, then you would need
> > > to map so that the hpa turns into the peer address.. but its always a round
> > > trip to IOMMU.
> >
> > This assumes all parts are doing the Request Redirect "correctly". In
> > our case there is a PCIe switch we are trying to debug and we have a
> > few working theories. One concern I have is that the switch may be
> > throwing an ACS violation for us using an address that matches a
> > neighboring device instead of redirecting it to the upstream port. If
> > we pull the switch and just run on the root complex the issue seems to
> > be resolved so I started poking into the code which led me to the
> > documentation pointing out what is supposed to be reserved based on
> > the root complex and MSI regions.
> >
> > As a part of going down that rabbit hole I realized that the
> > reserved_regions seems to only list the MSI reservation. However after
> > digging a bit deeper it seems like there is code to reserve the memory
> > behind the root complex in the IOVA but it doesn't look like that is
> > visible anywhere and is the piece I am currently trying to sort out.
> > What I am working on is trying to figure out if the system that is
> > failing is actually reserving that memory region in the IOVA, or if
> > that is somehow not happening in our test setup.
>
> I suspect with IOMMU, there is no need to pluck holes like we do for the
> MSI. In very early code in IOMMU i vaguely recall we did that, but our
> knowledge on ACS was weak. (not that has improved :-)).

The hole has to do mostly with avoiding any possibility of misrouting
things, or at least that was my understanding after reading it.

> Knowing how the switch and root ports are setup with forwarding may help
> with some clues. The easy option is maybe forcibly adding to the reserved
> range may help to see if you don't see the ACS violation.
>
> Baolu might have some better ideas.

I'm working with the team having the issue to try and verify that now.
In theory it should already be reserved so I am working with them to
check that.

Thanks,

- Alex

2023-06-08 18:52:02

by Alexander H Duyck

[permalink] [raw]
Subject: Re: Question about reserved_regions w/ Intel IOMMU

On Thu, Jun 8, 2023 at 11:02 AM Robin Murphy <[email protected]> wrote:
>
> On 2023-06-08 18:10, Alexander Duyck wrote:

<...>

> > As a part of going down that rabbit hole I realized that the
> > reserved_regions seems to only list the MSI reservation. However after
> > digging a bit deeper it seems like there is code to reserve the memory
> > behind the root complex in the IOVA but it doesn't look like that is
> > visible anywhere and is the piece I am currently trying to sort out.
> > What I am working on is trying to figure out if the system that is
> > failing is actually reserving that memory region in the IOVA, or if
> > that is somehow not happening in our test setup.
>
> How old's the kernel? Before 5.11, intel-iommu wasn't hooked up to
> iommu-dma so didn't do quite the same thing - it only reserved whatever
> specific PCI memory resources existed at boot, rather than the whole
> window as iommu-dma does. Either way, ftrace on reserve_iova() (or just
> whack a print in there) should suffice to see what's happened.
>
> Robin.

We are working with a 5.12 kernel. I will do some digging. We may be
able to backport some fixes if needed.

Thanks,

- Alex

2023-06-13 16:23:21

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: Question about reserved_regions w/ Intel IOMMU

On Thu, Jun 08, 2023 at 04:28:24PM +0100, Robin Murphy wrote:

> > The iova_reserve_pci_windows() you've seen is for kernel DMA interfaces
> > which is not related to peer-to-peer accesses.
>
> Right, in general the IOMMU driver cannot be held responsible for whatever
> might happen upstream of the IOMMU input.

The driver yes, but..

> The DMA layer carves PCI windows out of its IOVA space
> unconditionally because we know that they *might* be problematic,
> and we don't have any specific constraints on our IOVA layout so
> it's no big deal to just sacrifice some space for simplicity.

This is a problem for everything using UNMANAGED domains. If the iommu
API user picks an IOVA it should be able to expect it to work. If the
intereconnect fails to allow it to work then this has to be discovered
otherwise UNAMANGED domains are not usable at all.

Eg vfio and iommufd are also in trouble on these configurations.

We shouldn't expect every iommu user to fix this entirely on their
own.

> We don't want to have to go digging any further into bus-specific
> code to reason about whether the right ACS capabilities are present
> and enabled everywhere to prevent direct P2P or not. Other use-cases
> may have different requirements, though, so it's up to them what
> they want to do.

I agree the dma-iommu stuff doesn't have to be as precise as other
places might want (but also wouldn't be harmed by being more precise)

But I can't think of any place that can just ignore this and still be
correct..

So, I think it make sense that the iommu driver not be involved, but
IMHO the core code should have APIs to report IOVA that doesn't work
and every user of UNMANAGED domains needs to check it.

IOW it should probably come out of the existing reserved regions
interface.

Jason

2023-06-16 08:52:11

by Tian, Kevin

[permalink] [raw]
Subject: RE: Question about reserved_regions w/ Intel IOMMU

+Alex

> From: Jason Gunthorpe <[email protected]>
> Sent: Tuesday, June 13, 2023 11:54 PM
>
> On Thu, Jun 08, 2023 at 04:28:24PM +0100, Robin Murphy wrote:
>
> > > The iova_reserve_pci_windows() you've seen is for kernel DMA interfaces
> > > which is not related to peer-to-peer accesses.
> >
> > Right, in general the IOMMU driver cannot be held responsible for
> whatever
> > might happen upstream of the IOMMU input.
>
> The driver yes, but..
>
> > The DMA layer carves PCI windows out of its IOVA space
> > unconditionally because we know that they *might* be problematic,
> > and we don't have any specific constraints on our IOVA layout so
> > it's no big deal to just sacrifice some space for simplicity.
>
> This is a problem for everything using UNMANAGED domains. If the iommu
> API user picks an IOVA it should be able to expect it to work. If the
> intereconnect fails to allow it to work then this has to be discovered
> otherwise UNAMANGED domains are not usable at all.
>
> Eg vfio and iommufd are also in trouble on these configurations.
>

If those PCI windows are problematic e.g. due to ACS they belong to
a single iommu group. If a vfio user opens all the devices in that group
then it can discover and reserve those windows in its IOVA space. The
problem is that the user may not open all the devices then currently
there is no way for it to know the windows on those unopened devices.

Curious why nobody complains about this gap before this thread...

2023-06-16 12:41:32

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: Question about reserved_regions w/ Intel IOMMU

On Fri, Jun 16, 2023 at 08:39:46AM +0000, Tian, Kevin wrote:
> +Alex
>
> > From: Jason Gunthorpe <[email protected]>
> > Sent: Tuesday, June 13, 2023 11:54 PM
> >
> > On Thu, Jun 08, 2023 at 04:28:24PM +0100, Robin Murphy wrote:
> >
> > > > The iova_reserve_pci_windows() you've seen is for kernel DMA interfaces
> > > > which is not related to peer-to-peer accesses.
> > >
> > > Right, in general the IOMMU driver cannot be held responsible for
> > whatever
> > > might happen upstream of the IOMMU input.
> >
> > The driver yes, but..
> >
> > > The DMA layer carves PCI windows out of its IOVA space
> > > unconditionally because we know that they *might* be problematic,
> > > and we don't have any specific constraints on our IOVA layout so
> > > it's no big deal to just sacrifice some space for simplicity.
> >
> > This is a problem for everything using UNMANAGED domains. If the iommu
> > API user picks an IOVA it should be able to expect it to work. If the
> > intereconnect fails to allow it to work then this has to be discovered
> > otherwise UNAMANGED domains are not usable at all.
> >
> > Eg vfio and iommufd are also in trouble on these configurations.
> >
>
> If those PCI windows are problematic e.g. due to ACS they belong to
> a single iommu group. If a vfio user opens all the devices in that group
> then it can discover and reserve those windows in its IOVA space.

How? We don't even exclude the single device's BAR if there is no ACS?

> The problem is that the user may not open all the devices then
> currently there is no way for it to know the windows on those
> unopened devices.
>
> Curious why nobody complains about this gap before this thread...

Probably because it only matters if you have a real PCIe switch in the
system, which is pretty rare.

Jason

2023-06-16 15:38:12

by Alexander H Duyck

[permalink] [raw]
Subject: Re: Question about reserved_regions w/ Intel IOMMU

On Fri, Jun 16, 2023 at 5:20 AM Jason Gunthorpe <[email protected]> wrote:
>
> On Fri, Jun 16, 2023 at 08:39:46AM +0000, Tian, Kevin wrote:
> > +Alex
> >
> > > From: Jason Gunthorpe <[email protected]>
> > > Sent: Tuesday, June 13, 2023 11:54 PM
> > >
> > > On Thu, Jun 08, 2023 at 04:28:24PM +0100, Robin Murphy wrote:
> > >
> > > > > The iova_reserve_pci_windows() you've seen is for kernel DMA interfaces
> > > > > which is not related to peer-to-peer accesses.
> > > >
> > > > Right, in general the IOMMU driver cannot be held responsible for
> > > whatever
> > > > might happen upstream of the IOMMU input.
> > >
> > > The driver yes, but..
> > >
> > > > The DMA layer carves PCI windows out of its IOVA space
> > > > unconditionally because we know that they *might* be problematic,
> > > > and we don't have any specific constraints on our IOVA layout so
> > > > it's no big deal to just sacrifice some space for simplicity.
> > >
> > > This is a problem for everything using UNMANAGED domains. If the iommu
> > > API user picks an IOVA it should be able to expect it to work. If the
> > > intereconnect fails to allow it to work then this has to be discovered
> > > otherwise UNAMANGED domains are not usable at all.
> > >
> > > Eg vfio and iommufd are also in trouble on these configurations.
> > >
> >
> > If those PCI windows are problematic e.g. due to ACS they belong to
> > a single iommu group. If a vfio user opens all the devices in that group
> > then it can discover and reserve those windows in its IOVA space.
>
> How? We don't even exclude the single device's BAR if there is no ACS?

The issue here was a defective ACS on a PCIe switch.

> > The problem is that the user may not open all the devices then
> > currently there is no way for it to know the windows on those
> > unopened devices.
> >
> > Curious why nobody complains about this gap before this thread...
>
> Probably because it only matters if you have a real PCIe switch in the
> system, which is pretty rare.

So just FYI I am pretty sure we have a partitioned PCIe switch that
has FW issues. Specifically it doesn't seem to be honoring the
Redirect Request bit so what is happening is that we are seeing
requests that are supposed to be going to the root complex/IOMMU
getting redirected to an NVMe device that was on the same physical
PCIe switch. We are in the process of getting that sorted out now and
are using the forcedac option in the meantime to keep the IOMMU out of
the 32b address space that was causing the issue.

The reason for my original request is more about the user experience
of trying to figure out what is reserved and what isn't. It seems like
the IOVA will have reservations that are not visible to the end user.
So when I go looking through the reserved_regions in sysfs it just
lists the MSI regions that are reserved, and maybe some regions such
as the memory for USB. while in reality we may be reserving IOVA
regions in iova_reserve_pci_windows that will not be exposed without
having to add probes or adding some printk debugging.

2023-06-16 16:51:38

by Robin Murphy

[permalink] [raw]
Subject: Re: Question about reserved_regions w/ Intel IOMMU

On 2023-06-16 16:27, Alexander Duyck wrote:
> On Fri, Jun 16, 2023 at 5:20 AM Jason Gunthorpe <[email protected]> wrote:
>>
>> On Fri, Jun 16, 2023 at 08:39:46AM +0000, Tian, Kevin wrote:
>>> +Alex
>>>
>>>> From: Jason Gunthorpe <[email protected]>
>>>> Sent: Tuesday, June 13, 2023 11:54 PM
>>>>
>>>> On Thu, Jun 08, 2023 at 04:28:24PM +0100, Robin Murphy wrote:
>>>>
>>>>>> The iova_reserve_pci_windows() you've seen is for kernel DMA interfaces
>>>>>> which is not related to peer-to-peer accesses.
>>>>>
>>>>> Right, in general the IOMMU driver cannot be held responsible for
>>>> whatever
>>>>> might happen upstream of the IOMMU input.
>>>>
>>>> The driver yes, but..
>>>>
>>>>> The DMA layer carves PCI windows out of its IOVA space
>>>>> unconditionally because we know that they *might* be problematic,
>>>>> and we don't have any specific constraints on our IOVA layout so
>>>>> it's no big deal to just sacrifice some space for simplicity.
>>>>
>>>> This is a problem for everything using UNMANAGED domains. If the iommu
>>>> API user picks an IOVA it should be able to expect it to work. If the
>>>> intereconnect fails to allow it to work then this has to be discovered
>>>> otherwise UNAMANGED domains are not usable at all.
>>>>
>>>> Eg vfio and iommufd are also in trouble on these configurations.
>>>>
>>>
>>> If those PCI windows are problematic e.g. due to ACS they belong to
>>> a single iommu group. If a vfio user opens all the devices in that group
>>> then it can discover and reserve those windows in its IOVA space.
>>
>> How? We don't even exclude the single device's BAR if there is no ACS?
>
> The issue here was a defective ACS on a PCIe switch.
>
>>> The problem is that the user may not open all the devices then
>>> currently there is no way for it to know the windows on those
>>> unopened devices.
>>>
>>> Curious why nobody complains about this gap before this thread...
>>
>> Probably because it only matters if you have a real PCIe switch in the
>> system, which is pretty rare.
>
> So just FYI I am pretty sure we have a partitioned PCIe switch that
> has FW issues. Specifically it doesn't seem to be honoring the
> Redirect Request bit so what is happening is that we are seeing
> requests that are supposed to be going to the root complex/IOMMU
> getting redirected to an NVMe device that was on the same physical
> PCIe switch. We are in the process of getting that sorted out now and
> are using the forcedac option in the meantime to keep the IOMMU out of
> the 32b address space that was causing the issue.
>
> The reason for my original request is more about the user experience
> of trying to figure out what is reserved and what isn't. It seems like
> the IOVA will have reservations that are not visible to the end user.
> So when I go looking through the reserved_regions in sysfs it just
> lists the MSI regions that are reserved, and maybe some regions such
> as the memory for USB. while in reality we may be reserving IOVA
> regions in iova_reserve_pci_windows that will not be exposed without
> having to add probes or adding some printk debugging.

lspci -vvv seems to have no problem telling me about what PCI memory
space is assigned where, even as an unprivileged user, so surely it's
available to any VFIO user too?

It is not necessarily useful for eeh IOMMU layer to claim to userspace
that an entire window is unusable if in fact there's nothing in there
that would be treated as a P2P address so it's actually fine. As I say,
iommu-dma can make that assumption for itself because iommu-dma doesn't
need to maintain any particular address space layout, but it could be
overly restrictive for a userspace process or VMM which does.

If the system has working ACS configured correctly, then this issue
should be moot; if it doesn't, then a VFIO user is going to get a whole
group of peer devices if they're getting anything at all, so it doesn't
seem entirely unreasonable to leave it up to them to check that all
those devices' resources play well with their expected memory map. And
the particular case of a system which claims to have working ACS but
doesn't, doesn't really seem to be something that can or should be
worked around from userspace; if that switch can't be fixed, it probably
wants an ACS quirk adding in the kernel.

Thanks,
Robin.

2023-06-16 19:02:08

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: Question about reserved_regions w/ Intel IOMMU

On Fri, Jun 16, 2023 at 08:27:21AM -0700, Alexander Duyck wrote:

> > > The problem is that the user may not open all the devices then
> > > currently there is no way for it to know the windows on those
> > > unopened devices.
> > >
> > > Curious why nobody complains about this gap before this thread...
> >
> > Probably because it only matters if you have a real PCIe switch in the
> > system, which is pretty rare.
>
> So just FYI I am pretty sure we have a partitioned PCIe switch that
> has FW issues.

Yeah, that is pretty common :(

But I think you've touched on a gap in the API.

Jason

2023-06-16 19:17:38

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: Question about reserved_regions w/ Intel IOMMU

On Fri, Jun 16, 2023 at 05:34:53PM +0100, Robin Murphy wrote:
>
> If the system has working ACS configured correctly, then this issue should
> be moot;

Yes

> if it doesn't, then a VFIO user is going to get a whole group of
> peer devices if they're getting anything at all, so it doesn't seem entirely
> unreasonable to leave it up to them to check that all those devices'
> resources play well with their expected memory map.

I think the kernel should be helping here.. 'go figure it out from
lspci' is a very convoluted and obscure uAPI, and I don't see things
like DPDK actually doing that.

IMHO the uAPI expectation is that the kernel informs userspace what
the usable IOVA is, if bridge windows and lack of ACS are rendering
address space unusable then VFIO/iommufd should return it as excluded
as well.

If we are going to do that then all UNAMANGED domain users should
follow the same logic.

We probably have avoided bug reports because of how rare it would be
to see a switch and an UNMANAGED domain using scenario together -
especially with ACS turned off.

So it is really narrow niche.. Obscure enough I'm not going to make
patches :)

Jason

2023-06-19 10:43:12

by Robin Murphy

[permalink] [raw]
Subject: Re: Question about reserved_regions w/ Intel IOMMU

On 2023-06-16 19:59, Jason Gunthorpe wrote:
> On Fri, Jun 16, 2023 at 05:34:53PM +0100, Robin Murphy wrote:
>>
>> If the system has working ACS configured correctly, then this issue should
>> be moot;
>
> Yes
>
>> if it doesn't, then a VFIO user is going to get a whole group of
>> peer devices if they're getting anything at all, so it doesn't seem entirely
>> unreasonable to leave it up to them to check that all those devices'
>> resources play well with their expected memory map.
>
> I think the kernel should be helping here.. 'go figure it out from
> lspci' is a very convoluted and obscure uAPI, and I don't see things
> like DPDK actually doing that.
>
> IMHO the uAPI expectation is that the kernel informs userspace what
> the usable IOVA is, if bridge windows and lack of ACS are rendering
> address space unusable then VFIO/iommufd should return it as excluded
> as well.
>
> If we are going to do that then all UNAMANGED domain users should
> follow the same logic.
>
> We probably have avoided bug reports because of how rare it would be
> to see a switch and an UNMANAGED domain using scenario together -
> especially with ACS turned off.
>
> So it is really narrow niche.. Obscure enough I'm not going to make
> patches :)

The main thing is that we've already been round this once before; we
tried it 6 years ago and then reverted it a year later for causing more
problems than it solved:

https://lkml.org/lkml/2018/3/2/760

Thanks,
Robin.

2023-06-19 14:58:42

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: Question about reserved_regions w/ Intel IOMMU

On Mon, Jun 19, 2023 at 11:20:58AM +0100, Robin Murphy wrote:
> On 2023-06-16 19:59, Jason Gunthorpe wrote:
> > On Fri, Jun 16, 2023 at 05:34:53PM +0100, Robin Murphy wrote:
> > >
> > > If the system has working ACS configured correctly, then this issue should
> > > be moot;
> >
> > Yes
> >
> > > if it doesn't, then a VFIO user is going to get a whole group of
> > > peer devices if they're getting anything at all, so it doesn't seem entirely
> > > unreasonable to leave it up to them to check that all those devices'
> > > resources play well with their expected memory map.
> >
> > I think the kernel should be helping here.. 'go figure it out from
> > lspci' is a very convoluted and obscure uAPI, and I don't see things
> > like DPDK actually doing that.
> >
> > IMHO the uAPI expectation is that the kernel informs userspace what
> > the usable IOVA is, if bridge windows and lack of ACS are rendering
> > address space unusable then VFIO/iommufd should return it as excluded
> > as well.
> >
> > If we are going to do that then all UNAMANGED domain users should
> > follow the same logic.
> >
> > We probably have avoided bug reports because of how rare it would be
> > to see a switch and an UNMANAGED domain using scenario together -
> > especially with ACS turned off.
> >
> > So it is really narrow niche.. Obscure enough I'm not going to make
> > patches :)
>
> The main thing is that we've already been round this once before; we tried
> it 6 years ago and then reverted it a year later for causing more problems
> than it solved:

As I said earlier in this thread if we do it for VFIO then the
calculation must be precise and consider bus details like
ACS/etc. eg VFIO on an ACS system should not report any new regions.

It looks like that thread confirms we can't create reserved regions
which are wrong :)

I think Alex is saying the same things I'm saying in that thread too:

https://lore.kernel.org/all/[email protected]/

(b) is what the kernel should help prevent.

And it is clear there are today scenarios where a VFIO user will get
data loss because the reported valid IOVA from the kernel is
incorrect. Fixing this is hard, much harder than what commit
273df9635385 ("iommu/dma: Make PCI window reservation generic") has.

Thanks,
Jason

2023-06-20 15:15:30

by Alexander H Duyck

[permalink] [raw]
Subject: Re: Question about reserved_regions w/ Intel IOMMU

On Mon, Jun 19, 2023 at 7:02 AM Jason Gunthorpe <[email protected]> wrote:
>
> On Mon, Jun 19, 2023 at 11:20:58AM +0100, Robin Murphy wrote:
> > On 2023-06-16 19:59, Jason Gunthorpe wrote:
> > > On Fri, Jun 16, 2023 at 05:34:53PM +0100, Robin Murphy wrote:
> > > >
> > > > If the system has working ACS configured correctly, then this issue should
> > > > be moot;
> > >
> > > Yes
> > >
> > > > if it doesn't, then a VFIO user is going to get a whole group of
> > > > peer devices if they're getting anything at all, so it doesn't seem entirely
> > > > unreasonable to leave it up to them to check that all those devices'
> > > > resources play well with their expected memory map.
> > >
> > > I think the kernel should be helping here.. 'go figure it out from
> > > lspci' is a very convoluted and obscure uAPI, and I don't see things
> > > like DPDK actually doing that.
> > >
> > > IMHO the uAPI expectation is that the kernel informs userspace what
> > > the usable IOVA is, if bridge windows and lack of ACS are rendering
> > > address space unusable then VFIO/iommufd should return it as excluded
> > > as well.
> > >
> > > If we are going to do that then all UNAMANGED domain users should
> > > follow the same logic.
> > >
> > > We probably have avoided bug reports because of how rare it would be
> > > to see a switch and an UNMANAGED domain using scenario together -
> > > especially with ACS turned off.
> > >
> > > So it is really narrow niche.. Obscure enough I'm not going to make
> > > patches :)
> >
> > The main thing is that we've already been round this once before; we tried
> > it 6 years ago and then reverted it a year later for causing more problems
> > than it solved:
>
> As I said earlier in this thread if we do it for VFIO then the
> calculation must be precise and consider bus details like
> ACS/etc. eg VFIO on an ACS system should not report any new regions.
>
> It looks like that thread confirms we can't create reserved regions
> which are wrong :)
>
> I think Alex is saying the same things I'm saying in that thread too:
>
> https://lore.kernel.org/all/[email protected]/
>
> (b) is what the kernel should help prevent.
>
> And it is clear there are today scenarios where a VFIO user will get
> data loss because the reported valid IOVA from the kernel is
> incorrect. Fixing this is hard, much harder than what commit
> 273df9635385 ("iommu/dma: Make PCI window reservation generic") has.

I think this may have gone off down a rathole as my original question
wasn't anything about adding extra reserved regions. It was about
exposing what the IOVA is already reserving so it could be user
visible.

The issue was that the reservation(s) didn't appear in the
reserved_regions sysfs file, and it required adding probes or printk
debugging in order to figure out what is reserved and what is not.
Specifically what I was trying to point out is that there are regions
reserved in iova_reserve_pci_windows() that are not user/admin
visible. The function reserve_iova doesn't do anything to track the
reservation so that it can be recalled later for display. It made
things harder to debug as I wasn't sure if the addresses I was seeing
were valid for the IOMMU or not since I didn't know if they were
supposed to be reserved and the documentation I had found implied they
were.

2023-06-20 17:28:00

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: Question about reserved_regions w/ Intel IOMMU

On Tue, Jun 20, 2023 at 07:57:57AM -0700, Alexander Duyck wrote:

> I think this may have gone off down a rathole as my original question
> wasn't anything about adding extra reserved regions. It was about
> exposing what the IOVA is already reserving so it could be user
> visible.

Your question points out that dma-iommu.c uses a different set of
reserved regions than everything else, and its set is closer to
functionally correct.

IMHO the resolution to what you are talking about is not to add more
debugging to dma-iommu but to make the set of reserved regions
consistently correct for everyone, which will make them viewable in
sysfs.

Jason

2023-06-20 18:21:46

by Alexander H Duyck

[permalink] [raw]
Subject: Re: Question about reserved_regions w/ Intel IOMMU

On Tue, Jun 20, 2023 at 9:55 AM Jason Gunthorpe <[email protected]> wrote:
>
> On Tue, Jun 20, 2023 at 07:57:57AM -0700, Alexander Duyck wrote:
>
> > I think this may have gone off down a rathole as my original question
> > wasn't anything about adding extra reserved regions. It was about
> > exposing what the IOVA is already reserving so it could be user
> > visible.
>
> Your question points out that dma-iommu.c uses a different set of
> reserved regions than everything else, and its set is closer to
> functionally correct.
>
> IMHO the resolution to what you are talking about is not to add more
> debugging to dma-iommu but to make the set of reserved regions
> consistently correct for everyone, which will make them viewable in
> sysfs.

Okay, that makes sense to me, and I agree. If we had a consistent set
of reserved regions then it would make it easier to understand. If
nothing else my request would be to expose the iova reserved regions
and then most likely the other ones could be deprecated since they
seem to all be consolidated in the IOVA anyway.

Thanks,

- Alex

2023-06-21 08:40:13

by Tian, Kevin

[permalink] [raw]
Subject: RE: Question about reserved_regions w/ Intel IOMMU

> From: Jason Gunthorpe <[email protected]>
> Sent: Friday, June 16, 2023 8:21 PM
>
> On Fri, Jun 16, 2023 at 08:39:46AM +0000, Tian, Kevin wrote:
> > +Alex
> >
> > > From: Jason Gunthorpe <[email protected]>
> > > Sent: Tuesday, June 13, 2023 11:54 PM
> > >
> > > On Thu, Jun 08, 2023 at 04:28:24PM +0100, Robin Murphy wrote:
> > >
> > > > > The iova_reserve_pci_windows() you've seen is for kernel DMA
> interfaces
> > > > > which is not related to peer-to-peer accesses.
> > > >
> > > > Right, in general the IOMMU driver cannot be held responsible for
> > > whatever
> > > > might happen upstream of the IOMMU input.
> > >
> > > The driver yes, but..
> > >
> > > > The DMA layer carves PCI windows out of its IOVA space
> > > > unconditionally because we know that they *might* be problematic,
> > > > and we don't have any specific constraints on our IOVA layout so
> > > > it's no big deal to just sacrifice some space for simplicity.
> > >
> > > This is a problem for everything using UNMANAGED domains. If the
> iommu
> > > API user picks an IOVA it should be able to expect it to work. If the
> > > intereconnect fails to allow it to work then this has to be discovered
> > > otherwise UNAMANGED domains are not usable at all.
> > >
> > > Eg vfio and iommufd are also in trouble on these configurations.
> > >
> >
> > If those PCI windows are problematic e.g. due to ACS they belong to
> > a single iommu group. If a vfio user opens all the devices in that group
> > then it can discover and reserve those windows in its IOVA space.
>
> How? We don't even exclude the single device's BAR if there is no ACS?

I thought the initial vbar value in vfio is copied from physical BAR so
the user may check this value to skip. But it's informal and looks
today Qemu doesn't compose the GPA layout with any information
from there.

>
> > The problem is that the user may not open all the devices then
> > currently there is no way for it to know the windows on those
> > unopened devices.
> >
> > Curious why nobody complains about this gap before this thread...
>
> Probably because it only matters if you have a real PCIe switch in the
> system, which is pretty rare.
>

multi-devices group might not be rare given vfio has spent so many
effort to manage it.

More likely the virtual bios may reserve a big enough hole between
[3GB, 4GB] which happens to cover the physical BARs (if not 64bit)
in the group to avoid conflict, e.g.:

c0000000-febfffff : PCI Bus 0000:00
fd000000-fdffffff : 0000:00:01.0
fd000000-fdffffff : bochs-drm
fe000000-fe01ffff : 0000:00:02.0
fe020000-fe02ffff : 0000:00:02.0
fe030000-fe033fff : 0000:00:03.0
fe030000-fe033fff : virtio-pci-modern
feb80000-febbffff : 0000:00:03.0
febd0000-febd0fff : 0000:00:01.0
febd0000-febd0fff : bochs-drm
febd1000-febd1fff : 0000:00:03.0
febd2000-febd2fff : 0000:00:1f.2
febd2000-febd2fff : ahci
fec00000-fec003ff : IOAPIC 0
fed00000-fed003ff : HPET 0
fed00000-fed003ff : PNP0103:00
fed1c000-fed1ffff : Reserved
fed1f410-fed1f414 : iTCO_wdt.0.auto
fed90000-fed90fff : dmar0
fee00000-fee00fff : Local APIC
feffc000-feffffff : Reserved
fffc0000-ffffffff : Reserved

2023-06-21 12:22:52

by Robin Murphy

[permalink] [raw]
Subject: Re: Question about reserved_regions w/ Intel IOMMU

On 2023-06-20 18:47, Alexander Duyck wrote:
> On Tue, Jun 20, 2023 at 9:55 AM Jason Gunthorpe <[email protected]> wrote:
>>
>> On Tue, Jun 20, 2023 at 07:57:57AM -0700, Alexander Duyck wrote:
>>
>>> I think this may have gone off down a rathole as my original question
>>> wasn't anything about adding extra reserved regions. It was about
>>> exposing what the IOVA is already reserving so it could be user
>>> visible.
>>
>> Your question points out that dma-iommu.c uses a different set of
>> reserved regions than everything else, and its set is closer to
>> functionally correct.
>>
>> IMHO the resolution to what you are talking about is not to add more
>> debugging to dma-iommu but to make the set of reserved regions
>> consistently correct for everyone, which will make them viewable in
>> sysfs.
>
> Okay, that makes sense to me, and I agree. If we had a consistent set
> of reserved regions then it would make it easier to understand.

It would also be wrong, unfortunately, because it's conflating multiple
different things (there are overlapping notions of "reserve" at play
here...). IOMMU API reserved regions are specific things that the IOMMU
driver knows are special and all IOMMU domain users definitely need to
be aware of. iommu-dma is merely one of those users; it is another layer
on top of the API which manages its own IOVA space how it sees fit, just
like VFIO or other IOMMU-aware drivers. It honours those reserved
regions (via iommu_group_create_direct_mappings()), but it also carves
out plenty of IOVA space which is probably perfectly usable - some of
which is related to possible upstream bus constraints, to save the
hassle of checking; some purely for its own convenience, like the page
at IOVA 0 - but it still *doesn't* carve out more IOVA regions which are
also unusable overall due to other upstream bus or endpoint constraints,
since those are handled dynamically in its allocator instead (dma_mask,
bus_dma_limit etc.)

> If
> nothing else my request would be to expose the iova reserved regions
> and then most likely the other ones could be deprecated since they
> seem to all be consolidated in the IOVA anyway.

FWIW there's no upstream provision for debugging iommu-dma from
userspace since it's not something that anyone other than me has ever
had any apparent need to do, and you can get an idea of how long it's
been since even I thought about that from when I seem to have given up
rebasing my local patches for it[1] :)

Thanks,
Robin.

[1] https://gitlab.arm.com/linux-arm/linux-rm/-/commits/iommu/misc/