2021-06-07 03:00:48

by Tian, Kevin

[permalink] [raw]
Subject: Plan for /dev/ioasid RFC v2

Hi, all,

We plan to work on v2 now, given many good comments already received
and substantial changes envisioned. This is a very complex topic with
many sub-threads being discussed. To ensure that I didn't miss valuable
suggestions (and also keep everyone on the same page), here I'd like to
provide a list of planned changes in my mind. Please let me know if
anything important is lost. :)

--

(Remaining opens in v1)

- Protocol between kvm/vfio/ioasid for wbinvd/no-snoop. I'll see how
much can be refined based on discussion progress when v2 is out;

- Device-centric (Jason) vs. group-centric (David) uAPI. David is not fully
convinced yet. Based on discussion v2 will continue to have ioasid uAPI
being device-centric (but it's fine for vfio to be group-centric). A new
section will be added to elaborate this part;

- PASID virtualization (section 4) has not been thoroughly discussed yet.
Jason gave some suggestion on how to categorize intended usages.
I will rephrase this section and hope more discussions can be held for
it in v2;

(Adopted suggestions)

- (Jason) Rename /dev/ioasid to /dev/iommu (so does uAPI e.g. IOASID
_XXX to IOMMU_XXX). One suggestion (Jason) was to also rename
RID+PASID to SID+SSID. But given the familiarity of the former, I will
still use RID+PASID in v2 to ease the discussoin;

- (Jason) v1 prevents one device from binding to multiple ioasid_fd's. This
will be fixed in v2;

- (Jean/Jason) No need to track guest I/O page tables on ARM/AMD. When
a pasid table is bound, it becomes a container for all guest I/O page tables;

- (Jean/Jason) Accordingly a device label is required so iotlb invalidation
and fault handling can both support per-device operation. Per Jean's
suggestion, this label will come from userspace (when VFIO_BIND_
IOASID_FD);

- (Jason) Addition of device label allows per-device capability/format
check before IOASIDs are created. This leads to another major uAPI
change in v2 - specify format info when creating an IOASID (mapping
protocol, nesting, coherent, etc.). User is expected to check per-device
format and then set proper format for IOASID upon to-be-attached
device;

- (Jason/David) No restriction on map/unmap vs. bind/invalidate. They
can be used in either parent or child;

- (David) Change IOASID_GET_INFO to report permitted range instead of
reserved IOVA ranges. This works better for PPC;

- (Jason) For helper functions, expect to have explicit bus-type wrappers
e.g. ioasid_pci_device_attach;

(Not adopted)

- (Parav) Make page pinning a syscall;
- (Jason. W/Enrico) one I/O page table per fd;
- (David) Replace IOASID_REGISTER_MEMORY through another ioasid
nesting (sort of passthrough mode). Need more thinking. v2 will not
change this part;

Thanks
Kevin


2021-06-09 13:55:16

by Eric Auger

[permalink] [raw]
Subject: Re: Plan for /dev/ioasid RFC v2

Hi Kevin,

On 6/7/21 4:58 AM, Tian, Kevin wrote:
> Hi, all,
>
> We plan to work on v2 now, given many good comments already received
> and substantial changes envisioned. This is a very complex topic with
> many sub-threads being discussed. To ensure that I didn't miss valuable
> suggestions (and also keep everyone on the same page), here I'd like to
> provide a list of planned changes in my mind. Please let me know if
> anything important is lost. :)
>
> --
>
> (Remaining opens in v1)
>
> - Protocol between kvm/vfio/ioasid for wbinvd/no-snoop. I'll see how
> much can be refined based on discussion progress when v2 is out;
>
> - Device-centric (Jason) vs. group-centric (David) uAPI. David is not fully
> convinced yet. Based on discussion v2 will continue to have ioasid uAPI
> being device-centric (but it's fine for vfio to be group-centric). A new
> section will be added to elaborate this part;
>
> - PASID virtualization (section 4) has not been thoroughly discussed yet.
> Jason gave some suggestion on how to categorize intended usages.
> I will rephrase this section and hope more discussions can be held for
> it in v2;
>
> (Adopted suggestions)
>
> - (Jason) Rename /dev/ioasid to /dev/iommu (so does uAPI e.g. IOASID
> _XXX to IOMMU_XXX). One suggestion (Jason) was to also rename
> RID+PASID to SID+SSID. But given the familiarity of the former, I will
> still use RID+PASID in v2 to ease the discussoin;
>
> - (Jason) v1 prevents one device from binding to multiple ioasid_fd's. This
> will be fixed in v2;
>
> - (Jean/Jason) No need to track guest I/O page tables on ARM/AMD. When
> a pasid table is bound, it becomes a container for all guest I/O page tables;
while I am totally in line with that change, I guess we need to revisit
the invalidate ioctl
to support PASID table invalidation.
>
> - (Jean/Jason) Accordingly a device label is required so iotlb invalidation
> and fault handling can both support per-device operation. Per Jean's
> suggestion, this label will come from userspace (when VFIO_BIND_
> IOASID_FD);

what is not totally clear to me is the correspondance between this label
and the SID/SSID tuple.
My understanding is it rather maps to the SID because you can attach
several ioasids to the device.
So it is not clear to me how you reconstruct the SSID info

Thanks

Eric
>
> - (Jason) Addition of device label allows per-device capability/format
> check before IOASIDs are created. This leads to another major uAPI
> change in v2 - specify format info when creating an IOASID (mapping
> protocol, nesting, coherent, etc.). User is expected to check per-device
> format and then set proper format for IOASID upon to-be-attached
> device;

> - (Jason/David) No restriction on map/unmap vs. bind/invalidate. They
> can be used in either parent or child;
>
> - (David) Change IOASID_GET_INFO to report permitted range instead of
> reserved IOVA ranges. This works better for PPC;
>
> - (Jason) For helper functions, expect to have explicit bus-type wrappers
> e.g. ioasid_pci_device_attach;
>
> (Not adopted)
>
> - (Parav) Make page pinning a syscall;
> - (Jason. W/Enrico) one I/O page table per fd;
> - (David) Replace IOASID_REGISTER_MEMORY through another ioasid
> nesting (sort of passthrough mode). Need more thinking. v2 will not
> change this part;
>
> Thanks
> Kevin
>

2021-06-09 14:09:45

by Tian, Kevin

[permalink] [raw]
Subject: RE: Plan for /dev/ioasid RFC v2

> From: Eric Auger <[email protected]>
> Sent: Wednesday, June 9, 2021 4:15 PM
>
> Hi Kevin,
>
> On 6/7/21 4:58 AM, Tian, Kevin wrote:
> > Hi, all,
> >
> > We plan to work on v2 now, given many good comments already received
> > and substantial changes envisioned. This is a very complex topic with
> > many sub-threads being discussed. To ensure that I didn't miss valuable
> > suggestions (and also keep everyone on the same page), here I'd like to
> > provide a list of planned changes in my mind. Please let me know if
> > anything important is lost. :)
> >
> > --
> >
> > (Remaining opens in v1)
> >
> > - Protocol between kvm/vfio/ioasid for wbinvd/no-snoop. I'll see how
> > much can be refined based on discussion progress when v2 is out;
> >
> > - Device-centric (Jason) vs. group-centric (David) uAPI. David is not fully
> > convinced yet. Based on discussion v2 will continue to have ioasid uAPI
> > being device-centric (but it's fine for vfio to be group-centric). A new
> > section will be added to elaborate this part;
> >
> > - PASID virtualization (section 4) has not been thoroughly discussed yet.
> > Jason gave some suggestion on how to categorize intended usages.
> > I will rephrase this section and hope more discussions can be held for
> > it in v2;
> >
> > (Adopted suggestions)
> >
> > - (Jason) Rename /dev/ioasid to /dev/iommu (so does uAPI e.g. IOASID
> > _XXX to IOMMU_XXX). One suggestion (Jason) was to also rename
> > RID+PASID to SID+SSID. But given the familiarity of the former, I will
> > still use RID+PASID in v2 to ease the discussoin;
> >
> > - (Jason) v1 prevents one device from binding to multiple ioasid_fd's. This
> > will be fixed in v2;
> >
> > - (Jean/Jason) No need to track guest I/O page tables on ARM/AMD.
> When
> > a pasid table is bound, it becomes a container for all guest I/O page
> tables;
> while I am totally in line with that change, I guess we need to revisit
> the invalidate ioctl
> to support PASID table invalidation.

Yes, this is planned when doing this change.

> >
> > - (Jean/Jason) Accordingly a device label is required so iotlb invalidation
> > and fault handling can both support per-device operation. Per Jean's
> > suggestion, this label will come from userspace (when VFIO_BIND_
> > IOASID_FD);
>
> what is not totally clear to me is the correspondance between this label
> and the SID/SSID tuple.
> My understanding is it rather maps to the SID because you can attach
> several ioasids to the device.
> So it is not clear to me how you reconstruct the SSID info
>

Yes, device handle maps to SID. The fault data reported to userspace
will include {device_label, ioasid, vendor_fault_data}. In your case
I believe SSID will be included in vendor_fault_data thus no reconstruct
required. For Intel the user could figure out vPASID according to device_
label and ioasid, i.e. no need to include PASID info in vendor_fault_data.

Thanks
Kevin

2021-06-09 14:13:50

by Tian, Kevin

[permalink] [raw]
Subject: RE: Plan for /dev/ioasid RFC v2

> From: Leon Romanovsky <[email protected]>
> Sent: Wednesday, June 9, 2021 5:02 PM
>
> On Mon, Jun 07, 2021 at 02:58:18AM +0000, Tian, Kevin wrote:
> > Hi, all,
>
> <...>
>
> > (Remaining opens in v1)
>
> <...>
>
> > - Device-centric (Jason) vs. group-centric (David) uAPI. David is not fully
> > convinced yet. Based on discussion v2 will continue to have ioasid uAPI
> > being device-centric (but it's fine for vfio to be group-centric). A new
> > section will be added to elaborate this part;
>
> <...>
>
> > (Adopted suggestions)
>
> <...>
>
> > - (Jason) Addition of device label allows per-device capability/format
> > check before IOASIDs are created. This leads to another major uAPI
> > change in v2 - specify format info when creating an IOASID (mapping
> > protocol, nesting, coherent, etc.). User is expected to check per-device
> > format and then set proper format for IOASID upon to-be-attached
> > device;
>
> Sorry for my naive question, I still didn't read all v1 thread and maybe
> the answer is already written, but will ask anyway.
>
> Doesn't this adopted suggestion to allow device-specific configuration
> actually means that uAPI should be device-centric?
>
> User already needs to be aware of device, configure it explicitly, maybe
> gracefully clean it later, it looks like not so much left to be group-centric.
>

Yes, this is what v2 will lean toward. /dev/ioasid reports format info and
handle IOASID attachment per device. VFIO could still keep its group-
centric uAPI, but in the end it needs bind each device in the group to
IOASID FD one-by-one.

Thanks
Kevin

2021-06-09 14:21:43

by Eric Auger

[permalink] [raw]
Subject: Re: Plan for /dev/ioasid RFC v2

Hi Kevin,

On 6/9/21 11:37 AM, Tian, Kevin wrote:
>> From: Eric Auger <[email protected]>
>> Sent: Wednesday, June 9, 2021 4:15 PM
>>
>> Hi Kevin,
>>
>> On 6/7/21 4:58 AM, Tian, Kevin wrote:
>>> Hi, all,
>>>
>>> We plan to work on v2 now, given many good comments already received
>>> and substantial changes envisioned. This is a very complex topic with
>>> many sub-threads being discussed. To ensure that I didn't miss valuable
>>> suggestions (and also keep everyone on the same page), here I'd like to
>>> provide a list of planned changes in my mind. Please let me know if
>>> anything important is lost. :)
>>>
>>> --
>>>
>>> (Remaining opens in v1)
>>>
>>> - Protocol between kvm/vfio/ioasid for wbinvd/no-snoop. I'll see how
>>> much can be refined based on discussion progress when v2 is out;
>>>
>>> - Device-centric (Jason) vs. group-centric (David) uAPI. David is not fully
>>> convinced yet. Based on discussion v2 will continue to have ioasid uAPI
>>> being device-centric (but it's fine for vfio to be group-centric). A new
>>> section will be added to elaborate this part;
>>>
>>> - PASID virtualization (section 4) has not been thoroughly discussed yet.
>>> Jason gave some suggestion on how to categorize intended usages.
>>> I will rephrase this section and hope more discussions can be held for
>>> it in v2;
>>>
>>> (Adopted suggestions)
>>>
>>> - (Jason) Rename /dev/ioasid to /dev/iommu (so does uAPI e.g. IOASID
>>> _XXX to IOMMU_XXX). One suggestion (Jason) was to also rename
>>> RID+PASID to SID+SSID. But given the familiarity of the former, I will
>>> still use RID+PASID in v2 to ease the discussoin;
>>>
>>> - (Jason) v1 prevents one device from binding to multiple ioasid_fd's. This
>>> will be fixed in v2;
>>>
>>> - (Jean/Jason) No need to track guest I/O page tables on ARM/AMD.
>> When
>>> a pasid table is bound, it becomes a container for all guest I/O page
>> tables;
>> while I am totally in line with that change, I guess we need to revisit
>> the invalidate ioctl
>> to support PASID table invalidation.
> Yes, this is planned when doing this change.
OK
>
>>> - (Jean/Jason) Accordingly a device label is required so iotlb invalidation
>>> and fault handling can both support per-device operation. Per Jean's
>>> suggestion, this label will come from userspace (when VFIO_BIND_
>>> IOASID_FD);
>> what is not totally clear to me is the correspondance between this label
>> and the SID/SSID tuple.
>> My understanding is it rather maps to the SID because you can attach
>> several ioasids to the device.
>> So it is not clear to me how you reconstruct the SSID info
>>
> Yes, device handle maps to SID. The fault data reported to userspace
> will include {device_label, ioasid, vendor_fault_data}. In your case
> I believe SSID will be included in vendor_fault_data thus no reconstruct
> required. For Intel the user could figure out vPASID according to device_
> label and ioasid, i.e. no need to include PASID info in vendor_fault_data.
OK that works.

Thanks

Eric
>
> Thanks
> Kevin

2021-06-09 15:14:02

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: Plan for /dev/ioasid RFC v2

On Wed, Jun 09, 2021 at 02:24:03PM +0200, Joerg Roedel wrote:
> On Mon, Jun 07, 2021 at 02:58:18AM +0000, Tian, Kevin wrote:
> > - Device-centric (Jason) vs. group-centric (David) uAPI. David is not fully
> > convinced yet. Based on discussion v2 will continue to have ioasid uAPI
> > being device-centric (but it's fine for vfio to be group-centric). A new
> > section will be added to elaborate this part;
>
> I would vote for group-centric here. Or do the reasons for which VFIO is
> group-centric not apply to IOASID? If so, why?

VFIO being group centric has made it very ugly/difficult to inject
device driver specific knowledge into the scheme.

The device driver is the only thing that knows to ask:
- I need a SW table for this ioasid because I am like a mdev
- I will issue TLPs with PASID
- I need a IOASID linked to a PASID
- I am a devices that uses ENQCMD and vPASID
- etc in future

The current approach has the group try to guess the device driver
intention in the vfio type 1 code.

I want to see this be clean and have the device driver directly tell
the iommu layer what kind of DMA it plans to do, and thus how it needs
the IOMMU and IOASID configured.

This is the source of the ugly symbol_get and the very, very hacky 'if
you are a mdev *and* a iommu then you must want a single PASID' stuff
in type1.

The group is causing all this mess because the group knows nothing
about what the device drivers contained in the group actually want.

Further being group centric eliminates the possibility of working in
cases like !ACS. How do I use PASID functionality of a device behind a
!ACS switch if the uAPI forces all IOASID's to be linked to a group,
not a device?

Device centric with an report that "all devices in the group must use
the same IOASID" covers all the new functionality, keep the old, and
has a better chance to keep going as a uAPI into the future.

Jason

2021-06-09 15:56:38

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: Plan for /dev/ioasid RFC v2

On Wed, Jun 09, 2021 at 03:32:34PM +0200, Joerg Roedel wrote:

> > The group is causing all this mess because the group knows nothing
> > about what the device drivers contained in the group actually want.
>
> There are devices in the group, not drivers.

Well exactly, that is the whole problem.

Only *drivers* know what the actual device is going to do, devices do
not. Since the group doesn't have drivers it is the wrong layer to be
making choices about how to configure the IOMMU.

As I've said trying to cram these necessary choices through the group
has made mess. I think if people want to keep the group then they need
to come up with a reasonable in-kernel API that gets the driver
involved in the required decisions. ie figure out how to do PASID
support on VFIO type1 that isn't grotequesly hardwired to mdev like
today.

The device centric approach is my attempt at this, and it is pretty
clean, I think.

> > Further being group centric eliminates the possibility of working in
> > cases like !ACS. How do I use PASID functionality of a device behind a
> > !ACS switch if the uAPI forces all IOASID's to be linked to a group,
> > not a device?
>
> You don't use it, because it is not secure for devices which are not
> behind an ACS bridge.

All ACS does is prevent P2P operations, if you assign all the group
devices into the same /dev/iommu then you may not care about that
security isolation property. At the very least it is policy for user
to decide, not kernel.

> > Device centric with an report that "all devices in the group must use
> > the same IOASID" covers all the new functionality, keep the old, and
> > has a better chance to keep going as a uAPI into the future.
>
> If all devices in the group have to use the same IOASID anyway,

That isn't true! That is true *today* due to the API design but
nothing about the HW forces this, and with PASID it starts to become
problematic.

Groups should be primarily about isolation security, not about IOASID
matching.

Again, there is no reason to block PASID support in the vIOMMU if all
the devices in the group are assigned into the same VM, and the HW can
properly match the (RID,PASID). PASID can't transit a PCI-PCIe bridge,
PASID isn't supported by old IOMMUs that can't do RID matching, so
PASID scenarios should always be able to determine the source
regardless of what the group layout is.

Blocking this forever in the new uAPI just because group = IOASID is
some historical convenience makes no sense to me.

Jason

2021-06-09 16:37:59

by Alex Williamson

[permalink] [raw]
Subject: Re: Plan for /dev/ioasid RFC v2

On Wed, 9 Jun 2021 17:51:26 +0200
Joerg Roedel <[email protected]> wrote:

> On Wed, Jun 09, 2021 at 12:00:09PM -0300, Jason Gunthorpe wrote:
> > Only *drivers* know what the actual device is going to do, devices do
> > not. Since the group doesn't have drivers it is the wrong layer to be
> > making choices about how to configure the IOMMU.
>
> Groups don't carry how to configure IOMMUs, that information is
> mostly in the IOMMU domains. And those (or an abstraction of them) is
> configured through /dev/ioasid. So not sure what you wanted to say with
> the above.
>
> All a group carries is information about which devices are not
> sufficiently isolated from each other and thus need to always be in the
> same domain.
>
> > The device centric approach is my attempt at this, and it is pretty
> > clean, I think.
>
> Clean, but still insecure.
>
> > All ACS does is prevent P2P operations, if you assign all the group
> > devices into the same /dev/iommu then you may not care about that
> > security isolation property. At the very least it is policy for user
> > to decide, not kernel.
>
> It is a kernel decision, because a fundamental task of the kernel is to
> ensure isolation between user-space tasks as good as it can. And if a
> device assigned to one task can interfer with a device of another task
> (e.g. by sending P2P messages), then the promise of isolation is broken.

AIUI, the IOASID model will still enforce IOMMU groups, but it's not an
explicit part of the interface like it is for vfio. For example the
IOASID model allows attaching individual devices such that we have
granularity to create per device IOASIDs, but all devices within an
IOMMU group are required to be attached to an IOASID before they can be
used. It's not entirely clear to me yet how that last bit gets
implemented though, ie. what barrier is in place to prevent device
usage prior to reaching this viable state.

> > Groups should be primarily about isolation security, not about IOASID
> > matching.
>
> That doesn't make any sense, what do you mean by 'IOASID matching'?

One of the problems with the vfio interface use of groups is that we
conflate the IOMMU group for both isolation and granularity. I think
what Jason is referring to here is that we still want groups to be the
basis of isolation, but we don't want a uAPI that presumes all devices
within the group must use the same IOASID. For example, if a user owns
an IOMMU group consisting of non-isolated functions of a multi-function
device, they should be able to create a vIOMMU VM where each of those
functions has its own address space. That can't be done today, the
entire group would need to be attached to the VM under a PCIe-to-PCI
bridge to reflect the address space limitation imposed by the vfio
group uAPI model. Thanks,

Alex

2021-06-09 16:39:46

by Alex Williamson

[permalink] [raw]
Subject: Re: Plan for /dev/ioasid RFC v2

On Wed, 9 Jun 2021 10:15:32 -0600
Alex Williamson <[email protected]> wrote:

> On Wed, 9 Jun 2021 17:51:26 +0200
> Joerg Roedel <[email protected]> wrote:
>
> > On Wed, Jun 09, 2021 at 12:00:09PM -0300, Jason Gunthorpe wrote:
> > > Only *drivers* know what the actual device is going to do, devices do
> > > not. Since the group doesn't have drivers it is the wrong layer to be
> > > making choices about how to configure the IOMMU.
> >
> > Groups don't carry how to configure IOMMUs, that information is
> > mostly in the IOMMU domains. And those (or an abstraction of them) is
> > configured through /dev/ioasid. So not sure what you wanted to say with
> > the above.
> >
> > All a group carries is information about which devices are not
> > sufficiently isolated from each other and thus need to always be in the
> > same domain.
> >
> > > The device centric approach is my attempt at this, and it is pretty
> > > clean, I think.
> >
> > Clean, but still insecure.
> >
> > > All ACS does is prevent P2P operations, if you assign all the group
> > > devices into the same /dev/iommu then you may not care about that
> > > security isolation property. At the very least it is policy for user
> > > to decide, not kernel.
> >
> > It is a kernel decision, because a fundamental task of the kernel is to
> > ensure isolation between user-space tasks as good as it can. And if a
> > device assigned to one task can interfer with a device of another task
> > (e.g. by sending P2P messages), then the promise of isolation is broken.
>
> AIUI, the IOASID model will still enforce IOMMU groups, but it's not an
> explicit part of the interface like it is for vfio. For example the
> IOASID model allows attaching individual devices such that we have
> granularity to create per device IOASIDs, but all devices within an
> IOMMU group are required to be attached to an IOASID before they can be
> used. It's not entirely clear to me yet how that last bit gets
> implemented though, ie. what barrier is in place to prevent device
> usage prior to reaching this viable state.
>
> > > Groups should be primarily about isolation security, not about IOASID
> > > matching.
> >
> > That doesn't make any sense, what do you mean by 'IOASID matching'?
>
> One of the problems with the vfio interface use of groups is that we
> conflate the IOMMU group for both isolation and granularity. I think
> what Jason is referring to here is that we still want groups to be the
> basis of isolation, but we don't want a uAPI that presumes all devices
> within the group must use the same IOASID. For example, if a user owns
> an IOMMU group consisting of non-isolated functions of a multi-function
> device, they should be able to create a vIOMMU VM where each of those
> functions has its own address space. That can't be done today, the
> entire group would need to be attached to the VM under a PCIe-to-PCI
> bridge to reflect the address space limitation imposed by the vfio
> group uAPI model. Thanks,

Hmm, likely discussed previously in these threads, but I can't come up
with the argument that prevents us from making the BIND interface
at the group level but the ATTACH interface at the device level? For
example:

- VFIO_GROUP_BIND_IOASID_FD
- VFIO_DEVICE_ATTACH_IOASID

AFAICT that makes the group ownership more explicit but still allows
the device level IOASID granularity. Logically this is just an
internal iommu_group_for_each_dev() in the BIND ioctl. Thanks,

Alex

2021-06-09 17:15:21

by Leon Romanovsky

[permalink] [raw]
Subject: Re: Plan for /dev/ioasid RFC v2

On Mon, Jun 07, 2021 at 02:58:18AM +0000, Tian, Kevin wrote:
> Hi, all,

<...>

> (Remaining opens in v1)

<...>

> - Device-centric (Jason) vs. group-centric (David) uAPI. David is not fully
> convinced yet. Based on discussion v2 will continue to have ioasid uAPI
> being device-centric (but it's fine for vfio to be group-centric). A new
> section will be added to elaborate this part;

<...>

> (Adopted suggestions)

<...>

> - (Jason) Addition of device label allows per-device capability/format
> check before IOASIDs are created. This leads to another major uAPI
> change in v2 - specify format info when creating an IOASID (mapping
> protocol, nesting, coherent, etc.). User is expected to check per-device
> format and then set proper format for IOASID upon to-be-attached
> device;

Sorry for my naive question, I still didn't read all v1 thread and maybe
the answer is already written, but will ask anyway.

Doesn't this adopted suggestion to allow device-specific configuration
actually means that uAPI should be device-centric?

User already needs to be aware of device, configure it explicitly, maybe
gracefully clean it later, it looks like not so much left to be group-centric.

Thanks

2021-06-09 17:24:28

by Joerg Roedel

[permalink] [raw]
Subject: Re: Plan for /dev/ioasid RFC v2

On Mon, Jun 07, 2021 at 02:58:18AM +0000, Tian, Kevin wrote:
> - Device-centric (Jason) vs. group-centric (David) uAPI. David is not fully
> convinced yet. Based on discussion v2 will continue to have ioasid uAPI
> being device-centric (but it's fine for vfio to be group-centric). A new
> section will be added to elaborate this part;

I would vote for group-centric here. Or do the reasons for which VFIO is
group-centric not apply to IOASID? If so, why?

Regards,

Joerg

2021-06-09 18:09:31

by Joerg Roedel

[permalink] [raw]
Subject: Re: Plan for /dev/ioasid RFC v2

On Wed, Jun 09, 2021 at 12:00:09PM -0300, Jason Gunthorpe wrote:
> Only *drivers* know what the actual device is going to do, devices do
> not. Since the group doesn't have drivers it is the wrong layer to be
> making choices about how to configure the IOMMU.

Groups don't carry how to configure IOMMUs, that information is
mostly in the IOMMU domains. And those (or an abstraction of them) is
configured through /dev/ioasid. So not sure what you wanted to say with
the above.

All a group carries is information about which devices are not
sufficiently isolated from each other and thus need to always be in the
same domain.

> The device centric approach is my attempt at this, and it is pretty
> clean, I think.

Clean, but still insecure.

> All ACS does is prevent P2P operations, if you assign all the group
> devices into the same /dev/iommu then you may not care about that
> security isolation property. At the very least it is policy for user
> to decide, not kernel.

It is a kernel decision, because a fundamental task of the kernel is to
ensure isolation between user-space tasks as good as it can. And if a
device assigned to one task can interfer with a device of another task
(e.g. by sending P2P messages), then the promise of isolation is broken.

> Groups should be primarily about isolation security, not about IOASID
> matching.

That doesn't make any sense, what do you mean by 'IOASID matching'?

> Blocking this forever in the new uAPI just because group = IOASID is
> some historical convenience makes no sense to me.

I think it is safe to assume that devices supporting PASID will most
often be the only ones in their group. But for the non-PASID IOASID
use-cases like plain old device assignment to a VM it needs to be
group-centric.

Regards,

Joerg

2021-06-09 18:45:46

by Joerg Roedel

[permalink] [raw]
Subject: Re: Plan for /dev/ioasid RFC v2

On Wed, Jun 09, 2021 at 09:39:19AM -0300, Jason Gunthorpe wrote:
> VFIO being group centric has made it very ugly/difficult to inject
> device driver specific knowledge into the scheme.

This whole API will be complicated and difficult anyway, so no reason to
unnecessarily simplify things here.

VFIO is group-centric for security/isolation reasons, and since IOASID
is a uAPI it also needs to account for that. In the end the devices
which are going to use this will likely have their own group anyway, so
things will not get too complicated.

> The current approach has the group try to guess the device driver
> intention in the vfio type 1 code.
>
> I want to see this be clean and have the device driver directly tell
> the iommu layer what kind of DMA it plans to do, and thus how it needs
> the IOMMU and IOASID configured.

I am in for the general idea, it simplifies the code. But the kernel
still needs to check whether the wishlist from user-space can be
fulfilled.

> The group is causing all this mess because the group knows nothing
> about what the device drivers contained in the group actually want.

There are devices in the group, not drivers.

> Further being group centric eliminates the possibility of working in
> cases like !ACS. How do I use PASID functionality of a device behind a
> !ACS switch if the uAPI forces all IOASID's to be linked to a group,
> not a device?

You don't use it, because it is not secure for devices which are not
behind an ACS bridge.

> Device centric with an report that "all devices in the group must use
> the same IOASID" covers all the new functionality, keep the old, and
> has a better chance to keep going as a uAPI into the future.

If all devices in the group have to use the same IOASID anyway, we can
just as well force it by making the interface group-centric.

Regards,

Joerg

2021-06-09 18:51:17

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: Plan for /dev/ioasid RFC v2

On Wed, Jun 09, 2021 at 10:27:22AM -0600, Alex Williamson wrote:

> > > It is a kernel decision, because a fundamental task of the kernel is to
> > > ensure isolation between user-space tasks as good as it can. And if a
> > > device assigned to one task can interfer with a device of another task
> > > (e.g. by sending P2P messages), then the promise of isolation is broken.
> >
> > AIUI, the IOASID model will still enforce IOMMU groups, but it's not an
> > explicit part of the interface like it is for vfio. For example the
> > IOASID model allows attaching individual devices such that we have
> > granularity to create per device IOASIDs, but all devices within an
> > IOMMU group are required to be attached to an IOASID before they can be
> > used.

Yes, thanks Alex

> > It's not entirely clear to me yet how that last bit gets
> > implemented though, ie. what barrier is in place to prevent device
> > usage prior to reaching this viable state.

The major security checkpoint for the group is on the VFIO side. We
must require the group before userspace can be allowed access to any
device registers. Obtaining the device_fd from the group_fd does this
today as the group_fd is the security proof.

Actually, thinking about this some more.. If the only way to get a
working device_fd in the first place is to get it from the group_fd
and thus pass a group-based security check, why do we need to do
anything at the ioasid level?

The security concept of isolation was satisfied as soon as userspace
opened the group_fd. What do more checks in the kernel accomplish?

Yes, we have the issue where some groups require all devices to use
the same IOASID, but once someone has the group_fd that is no longer a
security issue. We can fail VFIO_DEVICE_ATTACH_IOASID callss that
don't make sense.

> > > > Groups should be primarily about isolation security, not about IOASID
> > > > matching.
> > >
> > > That doesn't make any sense, what do you mean by 'IOASID matching'?
> >
> > One of the problems with the vfio interface use of groups is that we
> > conflate the IOMMU group for both isolation and granularity. I think
> > what Jason is referring to here is that we still want groups to be the
> > basis of isolation, but we don't want a uAPI that presumes all devices
> > within the group must use the same IOASID.

Yes, thanks again Alex

> > For example, if a user owns an IOMMU group consisting of
> > non-isolated functions of a multi-function device, they should be
> > able to create a vIOMMU VM where each of those functions has its
> > own address space. That can't be done today, the entire group
> > would need to be attached to the VM under a PCIe-to-PCI bridge to
> > reflect the address space limitation imposed by the vfio group
> > uAPI model. Thanks,
>
> Hmm, likely discussed previously in these threads, but I can't come up
> with the argument that prevents us from making the BIND interface
> at the group level but the ATTACH interface at the device level? For
> example:
>
> - VFIO_GROUP_BIND_IOASID_FD
> - VFIO_DEVICE_ATTACH_IOASID
>
> AFAICT that makes the group ownership more explicit but still allows
> the device level IOASID granularity. Logically this is just an
> internal iommu_group_for_each_dev() in the BIND ioctl. Thanks,

At a high level it sounds OK.

However I think your above question needs to be answered - what do we
want to enforce on the iommu_fd and why?

Also, this creates a problem with the device label idea, we still
need to associate each device_fd with a label, so your above sequence
is probably:

VFIO_GROUP_BIND_IOASID_FD(group fd)
VFIO_BIND_IOASID_FD(device fd 1, device_label)
VFIO_BIND_IOASID_FD(device fd 2, device_label)
VFIO_DEVICE_ATTACH_IOASID(..)

And then I think we are back to where I had started, we can trigger
whatever VFIO_GROUP_BIND_IOASID_FD does automatically as soon as all
of the devices in the group have been bound.

Jason

2021-06-10 05:55:43

by Lu Baolu

[permalink] [raw]
Subject: Re: Plan for /dev/ioasid RFC v2

On 6/9/21 8:39 PM, Jason Gunthorpe wrote:
> On Wed, Jun 09, 2021 at 02:24:03PM +0200, Joerg Roedel wrote:
>> On Mon, Jun 07, 2021 at 02:58:18AM +0000, Tian, Kevin wrote:
>>> - Device-centric (Jason) vs. group-centric (David) uAPI. David is not fully
>>> convinced yet. Based on discussion v2 will continue to have ioasid uAPI
>>> being device-centric (but it's fine for vfio to be group-centric). A new
>>> section will be added to elaborate this part;
>> I would vote for group-centric here. Or do the reasons for which VFIO is
>> group-centric not apply to IOASID? If so, why?
> VFIO being group centric has made it very ugly/difficult to inject
> device driver specific knowledge into the scheme.
>
> The device driver is the only thing that knows to ask:
> - I need a SW table for this ioasid because I am like a mdev
> - I will issue TLPs with PASID
> - I need a IOASID linked to a PASID
> - I am a devices that uses ENQCMD and vPASID
> - etc in future
>
> The current approach has the group try to guess the device driver
> intention in the vfio type 1 code.
>
> I want to see this be clean and have the device driver directly tell
> the iommu layer what kind of DMA it plans to do, and thus how it needs
> the IOMMU and IOASID configured.
>
> This is the source of the ugly symbol_get and the very, very hacky 'if
> you are a mdev*and* a iommu then you must want a single PASID' stuff
> in type1.
>
> The group is causing all this mess because the group knows nothing
> about what the device drivers contained in the group actually want.
>
> Further being group centric eliminates the possibility of working in
> cases like !ACS. How do I use PASID functionality of a device behind a
> !ACS switch if the uAPI forces all IOASID's to be linked to a group,
> not a device?
>
> Device centric with an report that "all devices in the group must use
> the same IOASID" covers all the new functionality, keep the old, and
> has a better chance to keep going as a uAPI into the future.

The iommu_group can guarantee the isolation among different physical
devices (represented by RIDs). But when it comes to sub-devices (ex.
mdev or vDPA devices represented by RID + SSID), we have to rely on the
device driver for isolation. The devices which are able to generate sub-
devices should either use their own on-device mechanisms or use the
platform features like Intel Scalable IOV to isolate the sub-devices.

Under above conditions, different sub-device from a same RID device
could be able to use different IOASID. This seems to means that we can't
support mixed mode where, for example, two RIDs share an iommu_group and
one (or both) of them have sub-devices.

AIUI, when we attach a "RID + SSID" to an IOASID, we should require that
the RID doesn't share the iommu_group with any other RID.

Best regards,
baolu

2021-06-10 15:42:12

by Alex Williamson

[permalink] [raw]
Subject: Re: Plan for /dev/ioasid RFC v2

On Wed, 9 Jun 2021 15:49:40 -0300
Jason Gunthorpe <[email protected]> wrote:

> On Wed, Jun 09, 2021 at 10:27:22AM -0600, Alex Williamson wrote:
>
> > > > It is a kernel decision, because a fundamental task of the kernel is to
> > > > ensure isolation between user-space tasks as good as it can. And if a
> > > > device assigned to one task can interfer with a device of another task
> > > > (e.g. by sending P2P messages), then the promise of isolation is broken.
> > >
> > > AIUI, the IOASID model will still enforce IOMMU groups, but it's not an
> > > explicit part of the interface like it is for vfio. For example the
> > > IOASID model allows attaching individual devices such that we have
> > > granularity to create per device IOASIDs, but all devices within an
> > > IOMMU group are required to be attached to an IOASID before they can be
> > > used.
>
> Yes, thanks Alex
>
> > > It's not entirely clear to me yet how that last bit gets
> > > implemented though, ie. what barrier is in place to prevent device
> > > usage prior to reaching this viable state.
>
> The major security checkpoint for the group is on the VFIO side. We
> must require the group before userspace can be allowed access to any
> device registers. Obtaining the device_fd from the group_fd does this
> today as the group_fd is the security proof.
>
> Actually, thinking about this some more.. If the only way to get a
> working device_fd in the first place is to get it from the group_fd
> and thus pass a group-based security check, why do we need to do
> anything at the ioasid level?
>
> The security concept of isolation was satisfied as soon as userspace
> opened the group_fd. What do more checks in the kernel accomplish?

Opening the group is not the extent of the security check currently
required, the group must be added to a container and an IOMMU model
configured for the container *before* the user can get a devicefd.
Each devicefd creates a reference to this security context, therefore
access to a device does not exist without such a context.

This proposal has of course put the device before the group, which then
makes it more difficult for vfio to retroactively enforce security.

> Yes, we have the issue where some groups require all devices to use
> the same IOASID, but once someone has the group_fd that is no longer a
> security issue. We can fail VFIO_DEVICE_ATTACH_IOASID callss that
> don't make sense.

The groupfd only proves the user has an ownership claim to the devices,
it does not itself prove that the devices are in an isolated context.
Device access is not granted until that isolated context is configured.

vfio owns the device, so it would make sense for vfio to enforce the
security of device access only in a secure context, but how do we know
a device is in a secure context?

Is it sufficient to track the vfio device ioctls for attach/detach for
an IOASID or will the user be able to manipulate IOASID configuration
for a device directly via the IOASIDfd?

What happens on detach? As we've discussed elsewhere in this thread,
revoking access is more difficult than holding a reference to the
secure context, but I'm under the impression that moving a device
between IOASIDs could be standard practice in this new model. A device
that's detached from a secure context, even temporarily, is a problem.
Access to other devices in the same group as a device detached from a
secure context is a problem.

> > > > > Groups should be primarily about isolation security, not about IOASID
> > > > > matching.
> > > >
> > > > That doesn't make any sense, what do you mean by 'IOASID matching'?
> > >
> > > One of the problems with the vfio interface use of groups is that we
> > > conflate the IOMMU group for both isolation and granularity. I think
> > > what Jason is referring to here is that we still want groups to be the
> > > basis of isolation, but we don't want a uAPI that presumes all devices
> > > within the group must use the same IOASID.
>
> Yes, thanks again Alex
>
> > > For example, if a user owns an IOMMU group consisting of
> > > non-isolated functions of a multi-function device, they should be
> > > able to create a vIOMMU VM where each of those functions has its
> > > own address space. That can't be done today, the entire group
> > > would need to be attached to the VM under a PCIe-to-PCI bridge to
> > > reflect the address space limitation imposed by the vfio group
> > > uAPI model. Thanks,
> >
> > Hmm, likely discussed previously in these threads, but I can't come up
> > with the argument that prevents us from making the BIND interface
> > at the group level but the ATTACH interface at the device level? For
> > example:
> >
> > - VFIO_GROUP_BIND_IOASID_FD
> > - VFIO_DEVICE_ATTACH_IOASID
> >
> > AFAICT that makes the group ownership more explicit but still allows
> > the device level IOASID granularity. Logically this is just an
> > internal iommu_group_for_each_dev() in the BIND ioctl. Thanks,
>
> At a high level it sounds OK.
>
> However I think your above question needs to be answered - what do we
> want to enforce on the iommu_fd and why?
>
> Also, this creates a problem with the device label idea, we still
> need to associate each device_fd with a label, so your above sequence
> is probably:
>
> VFIO_GROUP_BIND_IOASID_FD(group fd)
> VFIO_BIND_IOASID_FD(device fd 1, device_label)
> VFIO_BIND_IOASID_FD(device fd 2, device_label)
> VFIO_DEVICE_ATTACH_IOASID(..)
>
> And then I think we are back to where I had started, we can trigger
> whatever VFIO_GROUP_BIND_IOASID_FD does automatically as soon as all
> of the devices in the group have been bound.

How to label a device seems like a relatively mundane issue relative to
ownership and isolated contexts of groups and devices. The label is
essentially just creating an identifier to device mapping, where the
identifier (label) will be used in the IOASID interface, right? As I
note above, that makes it difficult for vfio to maintain that a user
only accesses a device in a secure context. This is exactly why vfio
has the model of getting a devicefd from a groupfd only when that group
is in a secure context and maintaining references to that secure
context for each device. Split ownership of the secure context in
IOASID vs device access in vfio and exposing devicefds outside the group
is still a big question mark for me. Thanks,

Alex

2021-06-11 00:59:59

by Tian, Kevin

[permalink] [raw]
Subject: RE: Plan for /dev/ioasid RFC v2

Hi, Alex,

> From: Alex Williamson <[email protected]>
> Sent: Thursday, June 10, 2021 11:39 PM
>
> On Wed, 9 Jun 2021 15:49:40 -0300
> Jason Gunthorpe <[email protected]> wrote:
>
> > On Wed, Jun 09, 2021 at 10:27:22AM -0600, Alex Williamson wrote:
> >
> > > > > It is a kernel decision, because a fundamental task of the kernel is to
> > > > > ensure isolation between user-space tasks as good as it can. And if a
> > > > > device assigned to one task can interfer with a device of another task
> > > > > (e.g. by sending P2P messages), then the promise of isolation is
> broken.
> > > >
> > > > AIUI, the IOASID model will still enforce IOMMU groups, but it's not an
> > > > explicit part of the interface like it is for vfio. For example the
> > > > IOASID model allows attaching individual devices such that we have
> > > > granularity to create per device IOASIDs, but all devices within an
> > > > IOMMU group are required to be attached to an IOASID before they can
> be
> > > > used.
> >
> > Yes, thanks Alex
> >
> > > > It's not entirely clear to me yet how that last bit gets
> > > > implemented though, ie. what barrier is in place to prevent device
> > > > usage prior to reaching this viable state.
> >
> > The major security checkpoint for the group is on the VFIO side. We
> > must require the group before userspace can be allowed access to any
> > device registers. Obtaining the device_fd from the group_fd does this
> > today as the group_fd is the security proof.
> >
> > Actually, thinking about this some more.. If the only way to get a
> > working device_fd in the first place is to get it from the group_fd
> > and thus pass a group-based security check, why do we need to do
> > anything at the ioasid level?
> >
> > The security concept of isolation was satisfied as soon as userspace
> > opened the group_fd. What do more checks in the kernel accomplish?
>
> Opening the group is not the extent of the security check currently
> required, the group must be added to a container and an IOMMU model
> configured for the container *before* the user can get a devicefd.
> Each devicefd creates a reference to this security context, therefore
> access to a device does not exist without such a context.

IIUC each device has a default domain when it's probed by iommu driver
at boot time. This domain includes an empty page table, implying that
device is already in a security context before it's probed by device driver.

Now when this device is added to vfio, vfio creates another security
context through above sequence. This sequence requires the device to
switch from default security context to this new one, before it can be
accessed by user.

Then I wonder whether it's really necessary. As long as a device is in
a security context at any time, access to a device can be allowed. The
user itself should ensure that the access happens only after the device
creates a reference to the new security context that is desired by this
user.

Then what does group really bring to us?

With this new proposal we just need to make sure that a device cannot
be attached to any IOASID before all devices in its group are bound to
the IOASIDfd. If we want to start with a vfio-like policy, then all devices
in the group must be attached to the same IOASID. Or as Jason suggests,
they can attach to different IOASIDs (if in the group due to !ACS) if the
user wants, or have some devices attached while others detached since
both are in a security context anyway.

>
> This proposal has of course put the device before the group, which then
> makes it more difficult for vfio to retroactively enforce security.
>
> > Yes, we have the issue where some groups require all devices to use
> > the same IOASID, but once someone has the group_fd that is no longer a
> > security issue. We can fail VFIO_DEVICE_ATTACH_IOASID callss that
> > don't make sense.
>
> The groupfd only proves the user has an ownership claim to the devices,
> it does not itself prove that the devices are in an isolated context.
> Device access is not granted until that isolated context is configured.
>
> vfio owns the device, so it would make sense for vfio to enforce the
> security of device access only in a secure context, but how do we know
> a device is in a secure context?
>
> Is it sufficient to track the vfio device ioctls for attach/detach for
> an IOASID or will the user be able to manipulate IOASID configuration
> for a device directly via the IOASIDfd?
>
> What happens on detach? As we've discussed elsewhere in this thread,
> revoking access is more difficult than holding a reference to the
> secure context, but I'm under the impression that moving a device
> between IOASIDs could be standard practice in this new model. A device
> that's detached from a secure context, even temporarily, is a problem.
> Access to other devices in the same group as a device detached from a
> secure context is a problem.

as long as the device is switched back to the default security context
after detach then it should be fine.

>
> > > > > > Groups should be primarily about isolation security, not about
> IOASID
> > > > > > matching.
> > > > >
> > > > > That doesn't make any sense, what do you mean by 'IOASID matching'?
> > > >
> > > > One of the problems with the vfio interface use of groups is that we
> > > > conflate the IOMMU group for both isolation and granularity. I think
> > > > what Jason is referring to here is that we still want groups to be the
> > > > basis of isolation, but we don't want a uAPI that presumes all devices
> > > > within the group must use the same IOASID.
> >
> > Yes, thanks again Alex
> >
> > > > For example, if a user owns an IOMMU group consisting of
> > > > non-isolated functions of a multi-function device, they should be
> > > > able to create a vIOMMU VM where each of those functions has its
> > > > own address space. That can't be done today, the entire group
> > > > would need to be attached to the VM under a PCIe-to-PCI bridge to
> > > > reflect the address space limitation imposed by the vfio group
> > > > uAPI model. Thanks,
> > >
> > > Hmm, likely discussed previously in these threads, but I can't come up
> > > with the argument that prevents us from making the BIND interface
> > > at the group level but the ATTACH interface at the device level? For
> > > example:
> > >
> > > - VFIO_GROUP_BIND_IOASID_FD
> > > - VFIO_DEVICE_ATTACH_IOASID
> > >
> > > AFAICT that makes the group ownership more explicit but still allows
> > > the device level IOASID granularity. Logically this is just an
> > > internal iommu_group_for_each_dev() in the BIND ioctl. Thanks,
> >
> > At a high level it sounds OK.
> >
> > However I think your above question needs to be answered - what do we
> > want to enforce on the iommu_fd and why?
> >
> > Also, this creates a problem with the device label idea, we still
> > need to associate each device_fd with a label, so your above sequence
> > is probably:
> >
> > VFIO_GROUP_BIND_IOASID_FD(group fd)
> > VFIO_BIND_IOASID_FD(device fd 1, device_label)
> > VFIO_BIND_IOASID_FD(device fd 2, device_label)
> > VFIO_DEVICE_ATTACH_IOASID(..)
> >
> > And then I think we are back to where I had started, we can trigger
> > whatever VFIO_GROUP_BIND_IOASID_FD does automatically as soon as all
> > of the devices in the group have been bound.
>
> How to label a device seems like a relatively mundane issue relative to
> ownership and isolated contexts of groups and devices. The label is
> essentially just creating an identifier to device mapping, where the
> identifier (label) will be used in the IOASID interface, right? As I

Three usages in v2:

1) when reporting per-device capability/format info to user;
2) when handling device-wide iotlb invalidation from user;
3) when reporting device-specific fault data to user;

> note above, that makes it difficult for vfio to maintain that a user
> only accesses a device in a secure context. This is exactly why vfio
> has the model of getting a devicefd from a groupfd only when that group
> is in a secure context and maintaining references to that secure
> context for each device. Split ownership of the secure context in
> IOASID vs device access in vfio and exposing devicefds outside the group
> is still a big question mark for me. Thanks,
>

Thanks
Kevin

2021-06-11 16:49:04

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: Plan for /dev/ioasid RFC v2

On Thu, Jun 10, 2021 at 09:38:42AM -0600, Alex Williamson wrote:

> Opening the group is not the extent of the security check currently
> required, the group must be added to a container and an IOMMU model
> configured for the container *before* the user can get a devicefd.
> Each devicefd creates a reference to this security context, therefore
> access to a device does not exist without such a context.

Okay, I missed that detail in the organization..

So, if we have an independent vfio device fd then it needs to be
kept disable until the user joins it to an ioasid that provides the
security proof to allow it to work?

> What happens on detach? As we've discussed elsewhere in this thread,
> revoking access is more difficult than holding a reference to the
> secure context, but I'm under the impression that moving a device
> between IOASIDs could be standard practice in this new model. A device
> that's detached from a secure context, even temporarily, is a
> problem.

This is why I think the single iommu FD is critical, it is the FD, not
the IOASID that has to authorize the security. You shouldn't move
devices between FDs, but you can move them between IOASIDs inside the
same FD.

> How to label a device seems like a relatively mundane issue relative to
> ownership and isolated contexts of groups and devices. The label is
> essentially just creating an identifier to device mapping, where the
> identifier (label) will be used in the IOASID interface, right?

It looks that way

> As I note above, that makes it difficult for vfio to maintain that a
> user only accesses a device in a secure context. This is exactly
> why vfio has the model of getting a devicefd from a groupfd only
> when that group is in a secure context and maintaining references to
> that secure context for each device. Split ownership of the secure
> context in IOASID vs device access in vfio and exposing devicefds
> outside the group is still a big question mark for me. Thanks,

I think the protection model becomes different once we allow
individual devices inside a group to be attached to different
IOASID's.

Now we just want some general authorization that the user is allowed
to operate the device_fd.

To keep a fairly similar model to the way vfio does things today..

- The device_fd is single open, so only one fd exists globally

- Upon first joining the iommu_fd the group is obtained inside
the iommu_fd. This is only possible if no other iommu_fd has
obtained the group

- If the group can not be obtained then the device_fd is left
inoperable and cannot control the device

- If multiple devices in the same group are joined then they all
refcount the group

It is simple, and gives semantics similar to VFIO with the notable
difference that process can obtain a device FD, it is just inoperable
until the iommu_fd is attached.

Removal is OK as if you remove the device_fd from the iommu_fd (only
allowed by closing it) then a newly opened FD is inoperable.

Jason

2021-06-11 19:39:54

by Alex Williamson

[permalink] [raw]
Subject: Re: Plan for /dev/ioasid RFC v2

On Fri, 11 Jun 2021 13:45:29 -0300
Jason Gunthorpe <[email protected]> wrote:

> On Thu, Jun 10, 2021 at 09:38:42AM -0600, Alex Williamson wrote:
>
> > Opening the group is not the extent of the security check currently
> > required, the group must be added to a container and an IOMMU model
> > configured for the container *before* the user can get a devicefd.
> > Each devicefd creates a reference to this security context, therefore
> > access to a device does not exist without such a context.
>
> Okay, I missed that detail in the organization..
>
> So, if we have an independent vfio device fd then it needs to be
> kept disable until the user joins it to an ioasid that provides the
> security proof to allow it to work?

Yes, the user would effectively get a dummy fd with no device access
until not only that device, but every device in the IOMMU group is
attached to a secure context. Then we get into questions about whether
devices can be moved between contexts/ioasids within the same ioasidfd
and what that implies to both the device and all other devices within
the group as a device is transitioned and the system is potentially
exposed.

> > What happens on detach? As we've discussed elsewhere in this thread,
> > revoking access is more difficult than holding a reference to the
> > secure context, but I'm under the impression that moving a device
> > between IOASIDs could be standard practice in this new model. A device
> > that's detached from a secure context, even temporarily, is a
> > problem.
>
> This is why I think the single iommu FD is critical, it is the FD, not
> the IOASID that has to authorize the security. You shouldn't move
> devices between FDs, but you can move them between IOASIDs inside the
> same FD.

Right, but that doesn't solve the issue. Removing a device from one
isolated context, even if to move it to another isolated context within
the same ioasidfd exposes the device and has implications for all
devices within the group.

> > How to label a device seems like a relatively mundane issue relative to
> > ownership and isolated contexts of groups and devices. The label is
> > essentially just creating an identifier to device mapping, where the
> > identifier (label) will be used in the IOASID interface, right?
>
> It looks that way
>
> > As I note above, that makes it difficult for vfio to maintain that a
> > user only accesses a device in a secure context. This is exactly
> > why vfio has the model of getting a devicefd from a groupfd only
> > when that group is in a secure context and maintaining references to
> > that secure context for each device. Split ownership of the secure
> > context in IOASID vs device access in vfio and exposing devicefds
> > outside the group is still a big question mark for me. Thanks,
>
> I think the protection model becomes different once we allow
> individual devices inside a group to be attached to different
> IOASID's.
>
> Now we just want some general authorization that the user is allowed
> to operate the device_fd.

That's fine for a serial port, but not a device that can do DMA. The
entire point of vfio is to try to provide secure, DMA capable userspace
drivers. If we relax enforcement of that isolation we've failed.

> To keep a fairly similar model to the way vfio does things today..
>
> - The device_fd is single open, so only one fd exists globally
>
> - Upon first joining the iommu_fd the group is obtained inside
> the iommu_fd. This is only possible if no other iommu_fd has
> obtained the group

vfio_groups have an ownership model, iommu_groups do not.

> - If the group can not be obtained then the device_fd is left
> inoperable and cannot control the device
>
> - If multiple devices in the same group are joined then they all
> refcount the group
>
> It is simple, and gives semantics similar to VFIO with the notable
> difference that process can obtain a device FD, it is just inoperable
> until the iommu_fd is attached.
>
> Removal is OK as if you remove the device_fd from the iommu_fd (only
> allowed by closing it) then a newly opened FD is inoperable.

I don't see how this provides isolation. If a user only needs to
attach their devicefd to an ioasidfd to have full access to their
device, not even bound by attaching to an ioasid context, then we've
failed. All devices in a group must be bound to a secure context for
the extent of the time that any device in the group is operated by a
user. That seems non-negotiable to me. Thanks,

Alex

2021-06-11 21:40:50

by Alex Williamson

[permalink] [raw]
Subject: Re: Plan for /dev/ioasid RFC v2

On Fri, 11 Jun 2021 00:58:35 +0000
"Tian, Kevin" <[email protected]> wrote:

> Hi, Alex,
>
> > From: Alex Williamson <[email protected]>
> > Sent: Thursday, June 10, 2021 11:39 PM
> >
> > On Wed, 9 Jun 2021 15:49:40 -0300
> > Jason Gunthorpe <[email protected]> wrote:
> >
> > > On Wed, Jun 09, 2021 at 10:27:22AM -0600, Alex Williamson wrote:
> > >
> > > > > > It is a kernel decision, because a fundamental task of the kernel is to
> > > > > > ensure isolation between user-space tasks as good as it can. And if a
> > > > > > device assigned to one task can interfer with a device of another task
> > > > > > (e.g. by sending P2P messages), then the promise of isolation is
> > broken.
> > > > >
> > > > > AIUI, the IOASID model will still enforce IOMMU groups, but it's not an
> > > > > explicit part of the interface like it is for vfio. For example the
> > > > > IOASID model allows attaching individual devices such that we have
> > > > > granularity to create per device IOASIDs, but all devices within an
> > > > > IOMMU group are required to be attached to an IOASID before they can
> > be
> > > > > used.
> > >
> > > Yes, thanks Alex
> > >
> > > > > It's not entirely clear to me yet how that last bit gets
> > > > > implemented though, ie. what barrier is in place to prevent device
> > > > > usage prior to reaching this viable state.
> > >
> > > The major security checkpoint for the group is on the VFIO side. We
> > > must require the group before userspace can be allowed access to any
> > > device registers. Obtaining the device_fd from the group_fd does this
> > > today as the group_fd is the security proof.
> > >
> > > Actually, thinking about this some more.. If the only way to get a
> > > working device_fd in the first place is to get it from the group_fd
> > > and thus pass a group-based security check, why do we need to do
> > > anything at the ioasid level?
> > >
> > > The security concept of isolation was satisfied as soon as userspace
> > > opened the group_fd. What do more checks in the kernel accomplish?
> >
> > Opening the group is not the extent of the security check currently
> > required, the group must be added to a container and an IOMMU model
> > configured for the container *before* the user can get a devicefd.
> > Each devicefd creates a reference to this security context, therefore
> > access to a device does not exist without such a context.
>
> IIUC each device has a default domain when it's probed by iommu driver
> at boot time. This domain includes an empty page table, implying that
> device is already in a security context before it's probed by device driver.

The default domain could be passthrough though, right?

> Now when this device is added to vfio, vfio creates another security
> context through above sequence. This sequence requires the device to
> switch from default security context to this new one, before it can be
> accessed by user.

This is true currently, we use group semantics with the type1 IOMMU
backend to attach all devices in the group to a secure context,
regardless of the default domain.

> Then I wonder whether it's really necessary. As long as a device is in
> a security context at any time, access to a device can be allowed. The
> user itself should ensure that the access happens only after the device
> creates a reference to the new security context that is desired by this
> user.
>
> Then what does group really bring to us?

By definition an IOMMU group is the smallest set of devices that we
can consider isolated from all other devices. Therefore devices in a
group are not necessarily isolated from each other. Therefore if any
device within a group is not isolated, the group is not isolated. VFIO
needs to know when it's safe to provide userspace access to the device,
but the device isolation is dependent on the group isolation. The
group is therefore part of this picture whether implicit or explicit.

> With this new proposal we just need to make sure that a device cannot
> be attached to any IOASID before all devices in its group are bound to
> the IOASIDfd. If we want to start with a vfio-like policy, then all devices
> in the group must be attached to the same IOASID. Or as Jason suggests,
> they can attach to different IOASIDs (if in the group due to !ACS) if the
> user wants, or have some devices attached while others detached since
> both are in a security context anyway.

But if it's the device attachment to the IOASID that provides the
isolation and the user might attach a device to multiple IOASIDs within
the same IOASIDfd, and presumably make changes to the mapping of device
to IOASID dynamically, are we interrupting user access around each of
those changes? How would vfio be able to track this, and not only
track it per device, but for all devices in the group. Suggesting a
user needs to explicitly attach every device in the group is also a
semantic change versus existing vfio, where other devices in the group
must only be considered to be in a safe state for the group to be
usable.

The default domain may indeed be a solution to the problem, but we need
to enforce a secure default domain for all devices in the group. To me
that suggests that binding the *group* to an IOASIDfd is the point at
which device access becomes secure. VFIO should be able to consider
that the IOASIDfd binding has taken over ownership of the DMA context
for the device and it will always be either an empty, isolated, default
domain or a user defined IOASID.

Maybe the model relative to vfio is something like:

1. bind a group to an IOASIDfd
VFIO_GROUP_BIND_IOASID_FD(groupfd, ioasidfd)
2. create an IOASID label for each device
VFIO_DEVICE_SET_IOASID_LABEL(devicefd, device_ioasid_label)

VFIO can open access to the device after step 1, the IOASIDfd takes
responsibility for the device IOMMU context. After step 2, shouldn't
the user switch to the IOASID uAPI? I don't see why vfio would be
involved in attaching devices to specific IOASID contexts within the
IOASIDfd at that point, we might need internal compatibility
interfaces, but a native IOASID user should have all they need to
attach device labels to IOASIDs using the IOASIDfd at this point.

We'll need to figure out what the release model looks like too. A
group should hold a reference on the IOASIDfd and each open device
should hold a reference on the group so that the isolation of the group
cannot be broken while any device is open.

> > This proposal has of course put the device before the group, which then
> > makes it more difficult for vfio to retroactively enforce security.
> >
> > > Yes, we have the issue where some groups require all devices to use
> > > the same IOASID, but once someone has the group_fd that is no longer a
> > > security issue. We can fail VFIO_DEVICE_ATTACH_IOASID callss that
> > > don't make sense.
> >
> > The groupfd only proves the user has an ownership claim to the devices,
> > it does not itself prove that the devices are in an isolated context.
> > Device access is not granted until that isolated context is configured.
> >
> > vfio owns the device, so it would make sense for vfio to enforce the
> > security of device access only in a secure context, but how do we know
> > a device is in a secure context?
> >
> > Is it sufficient to track the vfio device ioctls for attach/detach for
> > an IOASID or will the user be able to manipulate IOASID configuration
> > for a device directly via the IOASIDfd?
> >
> > What happens on detach? As we've discussed elsewhere in this thread,
> > revoking access is more difficult than holding a reference to the
> > secure context, but I'm under the impression that moving a device
> > between IOASIDs could be standard practice in this new model. A device
> > that's detached from a secure context, even temporarily, is a problem.
> > Access to other devices in the same group as a device detached from a
> > secure context is a problem.
>
> as long as the device is switched back to the default security context
> after detach then it should be fine.

So long as the default context is secure, and ideally if IOMMU context
switches are atomic.

> > > > > > > Groups should be primarily about isolation security, not about
> > IOASID
> > > > > > > matching.
> > > > > >
> > > > > > That doesn't make any sense, what do you mean by 'IOASID matching'?
> > > > >
> > > > > One of the problems with the vfio interface use of groups is that we
> > > > > conflate the IOMMU group for both isolation and granularity. I think
> > > > > what Jason is referring to here is that we still want groups to be the
> > > > > basis of isolation, but we don't want a uAPI that presumes all devices
> > > > > within the group must use the same IOASID.
> > >
> > > Yes, thanks again Alex
> > >
> > > > > For example, if a user owns an IOMMU group consisting of
> > > > > non-isolated functions of a multi-function device, they should be
> > > > > able to create a vIOMMU VM where each of those functions has its
> > > > > own address space. That can't be done today, the entire group
> > > > > would need to be attached to the VM under a PCIe-to-PCI bridge to
> > > > > reflect the address space limitation imposed by the vfio group
> > > > > uAPI model. Thanks,
> > > >
> > > > Hmm, likely discussed previously in these threads, but I can't come up
> > > > with the argument that prevents us from making the BIND interface
> > > > at the group level but the ATTACH interface at the device level? For
> > > > example:
> > > >
> > > > - VFIO_GROUP_BIND_IOASID_FD
> > > > - VFIO_DEVICE_ATTACH_IOASID
> > > >
> > > > AFAICT that makes the group ownership more explicit but still allows
> > > > the device level IOASID granularity. Logically this is just an
> > > > internal iommu_group_for_each_dev() in the BIND ioctl. Thanks,
> > >
> > > At a high level it sounds OK.
> > >
> > > However I think your above question needs to be answered - what do we
> > > want to enforce on the iommu_fd and why?
> > >
> > > Also, this creates a problem with the device label idea, we still
> > > need to associate each device_fd with a label, so your above sequence
> > > is probably:
> > >
> > > VFIO_GROUP_BIND_IOASID_FD(group fd)
> > > VFIO_BIND_IOASID_FD(device fd 1, device_label)
> > > VFIO_BIND_IOASID_FD(device fd 2, device_label)
> > > VFIO_DEVICE_ATTACH_IOASID(..)
> > >
> > > And then I think we are back to where I had started, we can trigger
> > > whatever VFIO_GROUP_BIND_IOASID_FD does automatically as soon as all
> > > of the devices in the group have been bound.
> >
> > How to label a device seems like a relatively mundane issue relative to
> > ownership and isolated contexts of groups and devices. The label is
> > essentially just creating an identifier to device mapping, where the
> > identifier (label) will be used in the IOASID interface, right? As I
>
> Three usages in v2:
>
> 1) when reporting per-device capability/format info to user;
> 2) when handling device-wide iotlb invalidation from user;
> 3) when reporting device-specific fault data to user;

As above, it seems more complete to me to move attach/detach of devices
to IOASIDs using the labels as well. Thanks,

Alex

2021-06-12 01:30:38

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: Plan for /dev/ioasid RFC v2

On Fri, Jun 11, 2021 at 01:38:28PM -0600, Alex Williamson wrote:

> That's fine for a serial port, but not a device that can do DMA.
> The entire point of vfio is to try to provide secure, DMA capable
> userspace drivers. If we relax enforcement of that isolation we've
> failed.

I don't understand why the IOASID matters at all in this. Can you
explain? What is the breach of isolation?

A userspace process can create many IOASIDs, it can create an IOASID
that can touch any memory the process can touch. It can create two
IOASID's that are identical copies of each other.

How does restricting a device from attaching to an IOASID create
security if I can just make a copy of that IOASID and attach to that?
Is there some quirk of the IOMMU I'm missing?

My understanding of isolation has been that two different security
contexts cannot have access to devices in the same group because that
can leak access across a security bounday, eg because device A can do
DMA to device B and take control of it.

Isolation means that the control of the devices in a group is not
inadventantly spread between two security contexts, like two
processes.

> I don't see how this provides isolation. If a user only needs to
> attach their devicefd to an ioasidfd to have full access to their
> device, not even bound by attaching to an ioasid context, then we've
> failed.

That is not quite what I tried to explain. The first ioasid any device
in the group attaches to becomes the only ioasid that any device in
the group attaches to. It is an ownership model unique to the
iommu_fd.

It directly prevents process A and process B from opening devices in
the same group and trying to operate them independently. A and B will
not posses the same iommu fd so only one of them can activate a
device. The other device remains unusable.

Iin this model, I consider the iommu_fd to be the security domain. The
meaning of isolation is that only devices explicitly joined to an
iommu_fd can access IOASIDs in that iommu_fd.

Userspace has choices how to use this

Placing all devices in the same iommu_fd userspace is telling the
kernel that they are in the same security domain. This means userspace
says is safe for them to all share IOASIDs without isolation.

If userspace wants tigher security domains then userspace can create
additional iommu_fds, up to a unique iommu_fd per group. This would
duplicate the security model that the vfio groups force today.

The kernel security feature is to prevent un-isolated devices from
being joined to different iommu_fd security contexts.

Jason

2021-06-12 17:02:54

by Alex Williamson

[permalink] [raw]
Subject: Re: Plan for /dev/ioasid RFC v2

On Fri, 11 Jun 2021 22:28:46 -0300
Jason Gunthorpe <[email protected]> wrote:

> On Fri, Jun 11, 2021 at 01:38:28PM -0600, Alex Williamson wrote:
>
> > That's fine for a serial port, but not a device that can do DMA.
> > The entire point of vfio is to try to provide secure, DMA capable
> > userspace drivers. If we relax enforcement of that isolation we've
> > failed.
>
> I don't understand why the IOASID matters at all in this. Can you
> explain? What is the breach of isolation?

I think we're arguing past each other again. VFIO does not care one
iota how userspace configures IOASID domains for devices. OTOH, VFIO
must be absolutely obsessed that the devices we're providing userspace
access to are isolated and continue to be isolated for the extent of
that access. Given that we define that a group is the smallest set of
devices that can be isolated, that means that for a device to be
isolated, the group needs to be isolated.

VFIO currently has a contract with the IOMMU backend that a group is
attached to an IOMMU context (container) and from that point forward,
all devices within that group are known to be isolated.

I'm trying to figure out how a device based interface to the IOASID can
provide that same contract or whether VFIO needs to be able to monitor
the IOASID attachments of the devices in a group to control whether
device access is secure.

As I outlined to Kevin, I think it makes a lot of sense to maintain a
group interface to the IOASID where registering a group signifies a
hand-off of responsibility to the IOASIDfd that it is responsible for
the isolation of those devices. From there we can determine the value
of exposing VFIO device fds directly and whether any of the VFIO
interfaces for attaching devices to IOASIDs make sense versus switching
to the IOASIDfd at that point.

Otherwise, for a device centric VFIO/IOASID model, I need to understand
exactly when and how VFIO can know that it's safe to provide access to
a device and how the IOASID model guarantees the ongoing safety of that
access, which must encompass the safety relative to the entire group.

For example, is it VFIO's job to BIND every device in the group? Does
binding the device represent the point at which the IOASID takes
responsibility for the isolation of the device? If instead it's the
ATTACH of a device that provides the isolation, how is VFIO supposed to
handle device access across a group when one device is DETACH'd by the
user? If ATTACH is the point where isolation is guaranteed, can a
DETACH occur through the IOASIDfd rather than the VFIOfd? It seems
like the IOASIDfd is going to need ways to manipulate device:IOASID
mappings outside of VFIO, so again I wonder if we should switch to an
IOASID uAPI at that point rather than using VFIO. Thanks,

Alex

2021-06-14 03:11:34

by Tian, Kevin

[permalink] [raw]
Subject: RE: Plan for /dev/ioasid RFC v2

> From: Alex Williamson <[email protected]>
> Sent: Saturday, June 12, 2021 5:39 AM
>
> On Fri, 11 Jun 2021 00:58:35 +0000
> "Tian, Kevin" <[email protected]> wrote:
>
> > Hi, Alex,
> >
> > > From: Alex Williamson <[email protected]>
> > > Sent: Thursday, June 10, 2021 11:39 PM
> > >
> > > On Wed, 9 Jun 2021 15:49:40 -0300
> > > Jason Gunthorpe <[email protected]> wrote:
> > >
> > > > On Wed, Jun 09, 2021 at 10:27:22AM -0600, Alex Williamson wrote:
> > > >
> > > > > > > It is a kernel decision, because a fundamental task of the kernel is
> to
> > > > > > > ensure isolation between user-space tasks as good as it can. And if
> a
> > > > > > > device assigned to one task can interfer with a device of another
> task
> > > > > > > (e.g. by sending P2P messages), then the promise of isolation is
> > > broken.
> > > > > >
> > > > > > AIUI, the IOASID model will still enforce IOMMU groups, but it's not
> an
> > > > > > explicit part of the interface like it is for vfio. For example the
> > > > > > IOASID model allows attaching individual devices such that we have
> > > > > > granularity to create per device IOASIDs, but all devices within an
> > > > > > IOMMU group are required to be attached to an IOASID before they
> can
> > > be
> > > > > > used.
> > > >
> > > > Yes, thanks Alex
> > > >
> > > > > > It's not entirely clear to me yet how that last bit gets
> > > > > > implemented though, ie. what barrier is in place to prevent device
> > > > > > usage prior to reaching this viable state.
> > > >
> > > > The major security checkpoint for the group is on the VFIO side. We
> > > > must require the group before userspace can be allowed access to any
> > > > device registers. Obtaining the device_fd from the group_fd does this
> > > > today as the group_fd is the security proof.
> > > >
> > > > Actually, thinking about this some more.. If the only way to get a
> > > > working device_fd in the first place is to get it from the group_fd
> > > > and thus pass a group-based security check, why do we need to do
> > > > anything at the ioasid level?
> > > >
> > > > The security concept of isolation was satisfied as soon as userspace
> > > > opened the group_fd. What do more checks in the kernel accomplish?
> > >
> > > Opening the group is not the extent of the security check currently
> > > required, the group must be added to a container and an IOMMU model
> > > configured for the container *before* the user can get a devicefd.
> > > Each devicefd creates a reference to this security context, therefore
> > > access to a device does not exist without such a context.
> >
> > IIUC each device has a default domain when it's probed by iommu driver
> > at boot time. This domain includes an empty page table, implying that
> > device is already in a security context before it's probed by device driver.
>
> The default domain could be passthrough though, right?

Good point.

>
> > Now when this device is added to vfio, vfio creates another security
> > context through above sequence. This sequence requires the device to
> > switch from default security context to this new one, before it can be
> > accessed by user.
>
> This is true currently, we use group semantics with the type1 IOMMU
> backend to attach all devices in the group to a secure context,
> regardless of the default domain.
>
> > Then I wonder whether it's really necessary. As long as a device is in
> > a security context at any time, access to a device can be allowed. The
> > user itself should ensure that the access happens only after the device
> > creates a reference to the new security context that is desired by this
> > user.
> >
> > Then what does group really bring to us?
>
> By definition an IOMMU group is the smallest set of devices that we
> can consider isolated from all other devices. Therefore devices in a
> group are not necessarily isolated from each other. Therefore if any
> device within a group is not isolated, the group is not isolated. VFIO
> needs to know when it's safe to provide userspace access to the device,
> but the device isolation is dependent on the group isolation. The
> group is therefore part of this picture whether implicit or explicit.
>
> > With this new proposal we just need to make sure that a device cannot
> > be attached to any IOASID before all devices in its group are bound to
> > the IOASIDfd. If we want to start with a vfio-like policy, then all devices
> > in the group must be attached to the same IOASID. Or as Jason suggests,
> > they can attach to different IOASIDs (if in the group due to !ACS) if the
> > user wants, or have some devices attached while others detached since
> > both are in a security context anyway.
>
> But if it's the device attachment to the IOASID that provides the
> isolation and the user might attach a device to multiple IOASIDs within
> the same IOASIDfd, and presumably make changes to the mapping of device
> to IOASID dynamically, are we interrupting user access around each of
> those changes? How would vfio be able to track this, and not only
> track it per device, but for all devices in the group. Suggesting a
> user needs to explicitly attach every device in the group is also a
> semantic change versus existing vfio, where other devices in the group
> must only be considered to be in a safe state for the group to be
> usable.
>
> The default domain may indeed be a solution to the problem, but we need
> to enforce a secure default domain for all devices in the group. To me
> that suggests that binding the *group* to an IOASIDfd is the point at
> which device access becomes secure. VFIO should be able to consider
> that the IOASIDfd binding has taken over ownership of the DMA context
> for the device and it will always be either an empty, isolated, default
> domain or a user defined IOASID.

Yes, this is one way of enforcing the group security.

In the meantime, I'm thinking about another way whether group
security can be enforced in the iommu layer to relax the uAPI design.
If a device can be always blocked from accessing memory in the
IOMMU before it's bound to a driver or more specifically before
the driver moves it to a new security context, then there is no need
for VFIO to track whether IOASIDfd has taken over ownership of
the DMA context for all devices within a group.

But as you said this cannot be achieved via existing default domain
approach. So far a device is always attached to a domain:

- DOMAIN_IDENTITY: a default domain without DMA protection
- DOMAIN_DMA: a default domain with DMA protection via DMA
API and iommu core
- DOMAIN_UNMANAGED: a driver-created domain which is not
managed by iommu core.

The special sequence in current vfio group design is to mitigate
the 1st case, i.e. if a device is left in passthrough mode before
bound to VFIO it's definitely insecure to allow user to access it.
Then the sequence ensures that the user access is granted on it
only after all devices within a group switch to a security context.

Now if the new proposed scheme can be supported, a device
is always in a security context (block-dma) before it's switched
to a new security context and existing domain types should be
applied only in the new context when the device starts to do
DMAs. For VFIO case this switch happens explicitly when attaching
the device to an IOASID. For kernel driver it's implicit e.g. could
happen when the 1st DMA API call is received.

If this works I didn't see the need for vfio to keep the sequence.
VFIO still keeps group fd to claim ownership of all devices in a
group. Once it's done, vfio doesn't need to track the device attach
status and user access can be always granted regardless of
how the attach status changes. Moving a device from IOASID1
to IOASID2 involves detaching from IOASID1 (back to blocked
dma context) and then reattaching to IOASID2 (switch to a
new security context).

Following this direction even IOASIDfd doesn't need to verify
the group attach upon such guarantee from the iommu layer.
The devices within a group can be in different security contexts,
e.g. with some devices attached to GPA IOASID while others not
attached. In this way vfio userspace could choose to not attach
every device of a group to sustain the current semantics.

>
> Maybe the model relative to vfio is something like:
>
> 1. bind a group to an IOASIDfd
> VFIO_GROUP_BIND_IOASID_FD(groupfd, ioasidfd)
> 2. create an IOASID label for each device
> VFIO_DEVICE_SET_IOASID_LABEL(devicefd, device_ioasid_label)
>
> VFIO can open access to the device after step 1, the IOASIDfd takes
> responsibility for the device IOMMU context. After step 2, shouldn't
> the user switch to the IOASID uAPI? I don't see why vfio would be
> involved in attaching devices to specific IOASID contexts within the
> IOASIDfd at that point, we might need internal compatibility
> interfaces, but a native IOASID user should have all they need to
> attach device labels to IOASIDs using the IOASIDfd at this point.

In this proposal VFIO device driver is also responsible for PASID
virtualization since it's a per-device policy that only VFIO knows.
VFIO needs to provide PASID as routing information when doing
device bind. This is one open which hasn't been thoroughly
discussed in v1 and I'll have more clarification on this part in v2.

>
> We'll need to figure out what the release model looks like too. A
> group should hold a reference on the IOASIDfd and each open device
> should hold a reference on the group so that the isolation of the group
> cannot be broken while any device is open.
>
> > > This proposal has of course put the device before the group, which then
> > > makes it more difficult for vfio to retroactively enforce security.
> > >
> > > > Yes, we have the issue where some groups require all devices to use
> > > > the same IOASID, but once someone has the group_fd that is no longer
> a
> > > > security issue. We can fail VFIO_DEVICE_ATTACH_IOASID callss that
> > > > don't make sense.
> > >
> > > The groupfd only proves the user has an ownership claim to the devices,
> > > it does not itself prove that the devices are in an isolated context.
> > > Device access is not granted until that isolated context is configured.
> > >
> > > vfio owns the device, so it would make sense for vfio to enforce the
> > > security of device access only in a secure context, but how do we know
> > > a device is in a secure context?
> > >
> > > Is it sufficient to track the vfio device ioctls for attach/detach for
> > > an IOASID or will the user be able to manipulate IOASID configuration
> > > for a device directly via the IOASIDfd?
> > >
> > > What happens on detach? As we've discussed elsewhere in this thread,
> > > revoking access is more difficult than holding a reference to the
> > > secure context, but I'm under the impression that moving a device
> > > between IOASIDs could be standard practice in this new model. A device
> > > that's detached from a secure context, even temporarily, is a problem.
> > > Access to other devices in the same group as a device detached from a
> > > secure context is a problem.
> >
> > as long as the device is switched back to the default security context
> > after detach then it should be fine.
>
> So long as the default context is secure, and ideally if IOMMU context
> switches are atomic.

as long as every switch is to/from a block-dma context, then it should work. ????

>
> > > > > > > > Groups should be primarily about isolation security, not about
> > > IOASID
> > > > > > > > matching.
> > > > > > >
> > > > > > > That doesn't make any sense, what do you mean by 'IOASID
> matching'?
> > > > > >
> > > > > > One of the problems with the vfio interface use of groups is that we
> > > > > > conflate the IOMMU group for both isolation and granularity. I
> think
> > > > > > what Jason is referring to here is that we still want groups to be the
> > > > > > basis of isolation, but we don't want a uAPI that presumes all
> devices
> > > > > > within the group must use the same IOASID.
> > > >
> > > > Yes, thanks again Alex
> > > >
> > > > > > For example, if a user owns an IOMMU group consisting of
> > > > > > non-isolated functions of a multi-function device, they should be
> > > > > > able to create a vIOMMU VM where each of those functions has its
> > > > > > own address space. That can't be done today, the entire group
> > > > > > would need to be attached to the VM under a PCIe-to-PCI bridge to
> > > > > > reflect the address space limitation imposed by the vfio group
> > > > > > uAPI model. Thanks,
> > > > >
> > > > > Hmm, likely discussed previously in these threads, but I can't come up
> > > > > with the argument that prevents us from making the BIND interface
> > > > > at the group level but the ATTACH interface at the device level? For
> > > > > example:
> > > > >
> > > > > - VFIO_GROUP_BIND_IOASID_FD
> > > > > - VFIO_DEVICE_ATTACH_IOASID
> > > > >
> > > > > AFAICT that makes the group ownership more explicit but still allows
> > > > > the device level IOASID granularity. Logically this is just an
> > > > > internal iommu_group_for_each_dev() in the BIND ioctl. Thanks,
> > > >
> > > > At a high level it sounds OK.
> > > >
> > > > However I think your above question needs to be answered - what do
> we
> > > > want to enforce on the iommu_fd and why?
> > > >
> > > > Also, this creates a problem with the device label idea, we still
> > > > need to associate each device_fd with a label, so your above sequence
> > > > is probably:
> > > >
> > > > VFIO_GROUP_BIND_IOASID_FD(group fd)
> > > > VFIO_BIND_IOASID_FD(device fd 1, device_label)
> > > > VFIO_BIND_IOASID_FD(device fd 2, device_label)
> > > > VFIO_DEVICE_ATTACH_IOASID(..)
> > > >
> > > > And then I think we are back to where I had started, we can trigger
> > > > whatever VFIO_GROUP_BIND_IOASID_FD does automatically as soon as
> all
> > > > of the devices in the group have been bound.
> > >
> > > How to label a device seems like a relatively mundane issue relative to
> > > ownership and isolated contexts of groups and devices. The label is
> > > essentially just creating an identifier to device mapping, where the
> > > identifier (label) will be used in the IOASID interface, right? As I
> >
> > Three usages in v2:
> >
> > 1) when reporting per-device capability/format info to user;
> > 2) when handling device-wide iotlb invalidation from user;
> > 3) when reporting device-specific fault data to user;
>
> As above, it seems more complete to me to move attach/detach of devices
> to IOASIDs using the labels as well. Thanks,
>

Thanks
Kevin

2021-06-14 03:29:35

by Alex Williamson

[permalink] [raw]
Subject: Re: Plan for /dev/ioasid RFC v2

On Mon, 14 Jun 2021 03:09:31 +0000
"Tian, Kevin" <[email protected]> wrote:

> > From: Alex Williamson <[email protected]>
> > Sent: Saturday, June 12, 2021 5:39 AM
> >
> > On Fri, 11 Jun 2021 00:58:35 +0000
> > "Tian, Kevin" <[email protected]> wrote:
> >
> > > Hi, Alex,
> > >
> > > > From: Alex Williamson <[email protected]>
> > > > Sent: Thursday, June 10, 2021 11:39 PM
> > > >
> > > > On Wed, 9 Jun 2021 15:49:40 -0300
> > > > Jason Gunthorpe <[email protected]> wrote:
> > > >
> > > > > On Wed, Jun 09, 2021 at 10:27:22AM -0600, Alex Williamson wrote:
> > > > >
> > > > > > > > It is a kernel decision, because a fundamental task of the kernel is
> > to
> > > > > > > > ensure isolation between user-space tasks as good as it can. And if
> > a
> > > > > > > > device assigned to one task can interfer with a device of another
> > task
> > > > > > > > (e.g. by sending P2P messages), then the promise of isolation is
> > > > broken.
> > > > > > >
> > > > > > > AIUI, the IOASID model will still enforce IOMMU groups, but it's not
> > an
> > > > > > > explicit part of the interface like it is for vfio. For example the
> > > > > > > IOASID model allows attaching individual devices such that we have
> > > > > > > granularity to create per device IOASIDs, but all devices within an
> > > > > > > IOMMU group are required to be attached to an IOASID before they
> > can
> > > > be
> > > > > > > used.
> > > > >
> > > > > Yes, thanks Alex
> > > > >
> > > > > > > It's not entirely clear to me yet how that last bit gets
> > > > > > > implemented though, ie. what barrier is in place to prevent device
> > > > > > > usage prior to reaching this viable state.
> > > > >
> > > > > The major security checkpoint for the group is on the VFIO side. We
> > > > > must require the group before userspace can be allowed access to any
> > > > > device registers. Obtaining the device_fd from the group_fd does this
> > > > > today as the group_fd is the security proof.
> > > > >
> > > > > Actually, thinking about this some more.. If the only way to get a
> > > > > working device_fd in the first place is to get it from the group_fd
> > > > > and thus pass a group-based security check, why do we need to do
> > > > > anything at the ioasid level?
> > > > >
> > > > > The security concept of isolation was satisfied as soon as userspace
> > > > > opened the group_fd. What do more checks in the kernel accomplish?
> > > >
> > > > Opening the group is not the extent of the security check currently
> > > > required, the group must be added to a container and an IOMMU model
> > > > configured for the container *before* the user can get a devicefd.
> > > > Each devicefd creates a reference to this security context, therefore
> > > > access to a device does not exist without such a context.
> > >
> > > IIUC each device has a default domain when it's probed by iommu driver
> > > at boot time. This domain includes an empty page table, implying that
> > > device is already in a security context before it's probed by device driver.
> >
> > The default domain could be passthrough though, right?
>
> Good point.
>
> >
> > > Now when this device is added to vfio, vfio creates another security
> > > context through above sequence. This sequence requires the device to
> > > switch from default security context to this new one, before it can be
> > > accessed by user.
> >
> > This is true currently, we use group semantics with the type1 IOMMU
> > backend to attach all devices in the group to a secure context,
> > regardless of the default domain.
> >
> > > Then I wonder whether it's really necessary. As long as a device is in
> > > a security context at any time, access to a device can be allowed. The
> > > user itself should ensure that the access happens only after the device
> > > creates a reference to the new security context that is desired by this
> > > user.
> > >
> > > Then what does group really bring to us?
> >
> > By definition an IOMMU group is the smallest set of devices that we
> > can consider isolated from all other devices. Therefore devices in a
> > group are not necessarily isolated from each other. Therefore if any
> > device within a group is not isolated, the group is not isolated. VFIO
> > needs to know when it's safe to provide userspace access to the device,
> > but the device isolation is dependent on the group isolation. The
> > group is therefore part of this picture whether implicit or explicit.
> >
> > > With this new proposal we just need to make sure that a device cannot
> > > be attached to any IOASID before all devices in its group are bound to
> > > the IOASIDfd. If we want to start with a vfio-like policy, then all devices
> > > in the group must be attached to the same IOASID. Or as Jason suggests,
> > > they can attach to different IOASIDs (if in the group due to !ACS) if the
> > > user wants, or have some devices attached while others detached since
> > > both are in a security context anyway.
> >
> > But if it's the device attachment to the IOASID that provides the
> > isolation and the user might attach a device to multiple IOASIDs within
> > the same IOASIDfd, and presumably make changes to the mapping of device
> > to IOASID dynamically, are we interrupting user access around each of
> > those changes? How would vfio be able to track this, and not only
> > track it per device, but for all devices in the group. Suggesting a
> > user needs to explicitly attach every device in the group is also a
> > semantic change versus existing vfio, where other devices in the group
> > must only be considered to be in a safe state for the group to be
> > usable.
> >
> > The default domain may indeed be a solution to the problem, but we need
> > to enforce a secure default domain for all devices in the group. To me
> > that suggests that binding the *group* to an IOASIDfd is the point at
> > which device access becomes secure. VFIO should be able to consider
> > that the IOASIDfd binding has taken over ownership of the DMA context
> > for the device and it will always be either an empty, isolated, default
> > domain or a user defined IOASID.
>
> Yes, this is one way of enforcing the group security.
>
> In the meantime, I'm thinking about another way whether group
> security can be enforced in the iommu layer to relax the uAPI design.
> If a device can be always blocked from accessing memory in the
> IOMMU before it's bound to a driver or more specifically before
> the driver moves it to a new security context, then there is no need
> for VFIO to track whether IOASIDfd has taken over ownership of
> the DMA context for all devices within a group.

But we know we don't have IOMMU level isolation between devices in the
same group, so I don't see how this helps us.

> But as you said this cannot be achieved via existing default domain
> approach. So far a device is always attached to a domain:
>
> - DOMAIN_IDENTITY: a default domain without DMA protection
> - DOMAIN_DMA: a default domain with DMA protection via DMA
> API and iommu core
> - DOMAIN_UNMANAGED: a driver-created domain which is not
> managed by iommu core.
>
> The special sequence in current vfio group design is to mitigate
> the 1st case, i.e. if a device is left in passthrough mode before
> bound to VFIO it's definitely insecure to allow user to access it.
> Then the sequence ensures that the user access is granted on it
> only after all devices within a group switch to a security context.
>
> Now if the new proposed scheme can be supported, a device
> is always in a security context (block-dma) before it's switched
> to a new security context and existing domain types should be
> applied only in the new context when the device starts to do
> DMAs. For VFIO case this switch happens explicitly when attaching
> the device to an IOASID. For kernel driver it's implicit e.g. could
> happen when the 1st DMA API call is received.
>
> If this works I didn't see the need for vfio to keep the sequence.
> VFIO still keeps group fd to claim ownership of all devices in a
> group. Once it's done, vfio doesn't need to track the device attach
> status and user access can be always granted regardless of
> how the attach status changes. Moving a device from IOASID1
> to IOASID2 involves detaching from IOASID1 (back to blocked
> dma context) and then reattaching to IOASID2 (switch to a
> new security context).
>
> Following this direction even IOASIDfd doesn't need to verify
> the group attach upon such guarantee from the iommu layer.
> The devices within a group can be in different security contexts,
> e.g. with some devices attached to GPA IOASID while others not
> attached. In this way vfio userspace could choose to not attach
> every device of a group to sustain the current semantics.

It seems like this entirely misses the point of groups with multiple
devices. If we had IOMMU level isolation between all devices, we'd
never have multi-device groups. Thanks,

Alex

2021-06-14 13:41:31

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: Plan for /dev/ioasid RFC v2

On Mon, Jun 14, 2021 at 03:09:31AM +0000, Tian, Kevin wrote:

> If a device can be always blocked from accessing memory in the IOMMU
> before it's bound to a driver or more specifically before the driver
> moves it to a new security context, then there is no need for VFIO
> to track whether IOASIDfd has taken over ownership of the DMA
> context for all devices within a group.

I've been assuming we'd do something like this, where when a device is
first turned into a VFIO it tells the IOMMU layer that this device
should be DMA blocked unless an IOASID is attached to
it. Disconnecting an IOASID returns it to blocked.

> If this works I didn't see the need for vfio to keep the sequence.
> VFIO still keeps group fd to claim ownership of all devices in a
> group.

As Alex says you still have to deal with the problem that device A in
a group can gain control of device B in the same group.

This means device A and B can not be used from to two different
security contexts.

If the /dev/iommu FD is the security context then the tracking is
needed there.

Jason

2021-06-14 14:08:49

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: Plan for /dev/ioasid RFC v2

On Sat, Jun 12, 2021 at 10:57:11AM -0600, Alex Williamson wrote:
> On Fri, 11 Jun 2021 22:28:46 -0300
> Jason Gunthorpe <[email protected]> wrote:
>
> > On Fri, Jun 11, 2021 at 01:38:28PM -0600, Alex Williamson wrote:
> >
> > > That's fine for a serial port, but not a device that can do DMA.
> > > The entire point of vfio is to try to provide secure, DMA capable
> > > userspace drivers. If we relax enforcement of that isolation we've
> > > failed.
> >
> > I don't understand why the IOASID matters at all in this. Can you
> > explain? What is the breach of isolation?
>
> I think we're arguing past each other again. VFIO does not care one
> iota how userspace configures IOASID domains for devices. OTOH, VFIO
> must be absolutely obsessed that the devices we're providing userspace
> access to are isolated and continue to be isolated for the extent of
> that access. Given that we define that a group is the smallest set of
> devices that can be isolated, that means that for a device to be
> isolated, the group needs to be isolated.
>
> VFIO currently has a contract with the IOMMU backend that a group is
> attached to an IOMMU context (container) and from that point forward,
> all devices within that group are known to be isolated.

Sure - and maybe this is the source of the confusion as I've been
assuming we'd change the kernel to match what we are doing. As in the
other note a device under VFIO control should immediately have it's
IOMMU programmed to block all DMA. This is basically attaching it to a
dummy ioasid with an empty page table.

So before VFIO exposes any char device all devices/groups under VFIO
control cannot do any DMA. The only security/isolation harmful action
they can do is DMA to devices in the same group.

> I'm trying to figure out how a device based interface to the IOASID can
> provide that same contract or whether VFIO needs to be able to monitor
> the IOASID attachments of the devices in a group to control whether
> device access is secure.

Can you define what specifically secure, and isolation means?

To my mind it is these three things:

1. The device can only do DMA to memory put into its security context
2. No other security context can control this device
3. No other security context can do DMA to my userspace memory

Today in VFIO the security context is the group fd. I would like the
security context to be the iommu fd.

1 is achieved by ensuring the device is always connected to an
IOASID. Today the group fd requires an IOASID before it hands out a
device_fd. With iommu_fd the device_fd will not allow IOCTLs until it
has a blocked DMA IOASID and is successefully joined to an iommu_fd.

2 is achieved by ensuring that two security contexts can't open
devices in the same group. Today the group fd deals with this by being
single open. With iommu_fd the kenerl would not permit splitting
groups between iommu_fds.

3 is achieved today by the group_fd enforcing a single IOASID on all
devices. Under iommu_fd all devices in the group can use any IOASID in
their iommu_fd security domain.

It is a slightly different model than VFIO uses, but I don't think it
provides less isolation.

> Otherwise, for a device centric VFIO/IOASID model, I need to understand
> exactly when and how VFIO can know that it's safe to provide access to
> a device and how the IOASID model guarantees the ongoing safety of that
> access, which must encompass the safety relative to the entire group.

Lets agree on what safety means then we can evaluate it.

> For example, is it VFIO's job to BIND every device in the group?

I'm thinking no

> Does binding the device represent the point at which the IOASID
> takes responsibility for the isolation of the device?

Following Kevin's language BIND is when the device_fd and iommu_fd are
connected. That is when I see the device as becoming usable. Whatever
security/isolation requirements we decide should be met here

> If instead it's the ATTACH of a device that provides the isolation,
> how is VFIO supposed to

Not the attach

> DETACH occur through the IOASIDfd rather than the VFIOfd? It seems
> like the IOASIDfd is going to need ways to manipulate device:IOASID
> mappings outside of VFIO, so again I wonder if we should switch to an
> IOASID uAPI at that point rather than using VFIO. Thanks,

I don't think so... When the VFIO device_fd is closed it should
disonnect the iommu from its device, restore the blocked DMA
configuration, and then remove itself from the iommu_fd.

Once the device is back to blocked DMA there is no further need for
the iommu_fd to touch it.

Jason

2021-06-14 16:31:11

by Alex Williamson

[permalink] [raw]
Subject: Re: Plan for /dev/ioasid RFC v2

On Mon, 14 Jun 2021 11:07:11 -0300
Jason Gunthorpe <[email protected]> wrote:

> On Sat, Jun 12, 2021 at 10:57:11AM -0600, Alex Williamson wrote:
> > On Fri, 11 Jun 2021 22:28:46 -0300
> > Jason Gunthorpe <[email protected]> wrote:
> >
> > > On Fri, Jun 11, 2021 at 01:38:28PM -0600, Alex Williamson wrote:
> > >
> > > > That's fine for a serial port, but not a device that can do DMA.
> > > > The entire point of vfio is to try to provide secure, DMA capable
> > > > userspace drivers. If we relax enforcement of that isolation we've
> > > > failed.
> > >
> > > I don't understand why the IOASID matters at all in this. Can you
> > > explain? What is the breach of isolation?
> >
> > I think we're arguing past each other again. VFIO does not care one
> > iota how userspace configures IOASID domains for devices. OTOH, VFIO
> > must be absolutely obsessed that the devices we're providing userspace
> > access to are isolated and continue to be isolated for the extent of
> > that access. Given that we define that a group is the smallest set of
> > devices that can be isolated, that means that for a device to be
> > isolated, the group needs to be isolated.
> >
> > VFIO currently has a contract with the IOMMU backend that a group is
> > attached to an IOMMU context (container) and from that point forward,
> > all devices within that group are known to be isolated.
>
> Sure - and maybe this is the source of the confusion as I've been
> assuming we'd change the kernel to match what we are doing. As in the
> other note a device under VFIO control should immediately have it's
> IOMMU programmed to block all DMA. This is basically attaching it to a
> dummy ioasid with an empty page table.
>
> So before VFIO exposes any char device all devices/groups under VFIO
> control cannot do any DMA. The only security/isolation harmful action
> they can do is DMA to devices in the same group.
>
> > I'm trying to figure out how a device based interface to the IOASID can
> > provide that same contract or whether VFIO needs to be able to monitor
> > the IOASID attachments of the devices in a group to control whether
> > device access is secure.
>
> Can you define what specifically secure, and isolation means?
>
> To my mind it is these three things:
>
> 1. The device can only do DMA to memory put into its security context

System memory or device memory, yes.

Corollary: The IOMMU group defines the minimum set of devices where the
IOMMU can control inter-device DMA.

> 2. No other security context can control this device
> 3. No other security context can do DMA to my userspace memory

Rule #1 is essentially the golden rule, the rest falls out from it.

> Today in VFIO the security context is the group fd. I would like the
> security context to be the iommu fd.

The vfio group is simply a representation of the IOMMU group, which is
the minimum isolation granularity. The group is therefore the minimum
security context, but itself is not a security context. The overall
security context for vfio is the set of containers (IOMMU contexts)
owned by a user, where each container defines the IOMMU context for a
set of groups. The user can map process and device memory between
containers within the same security context.

As you know, we have various issues with invalidation of device
mappings between containers, so simplifying the security context to the
ioasidfd seems like a good plan. The vfio notion of a container is
already encompassed in the IOASID of the ioasidfd.

The significant difference is therefore the device level IOASID versus
vfio's group level container granularity. This means the IOASID model
needs to incorporate the group model not only in terms of isolation,
but also address-ability. The vfio model allows these to be combined
as a significant simplification.

> 1 is achieved by ensuring the device is always connected to an

s/device/group/

As you note in reply to Kevin, in a multi-device group rule #1 can be
violated if only one device is connected to an IOASID.

> IOASID. Today the group fd requires an IOASID before it hands out a
> device_fd. With iommu_fd the device_fd will not allow IOCTLs until it
> has a blocked DMA IOASID and is successefully joined to an iommu_fd.

Which is the root of my concern. Who owns ioctls to the device fd?
It's my understanding this is a vfio provided file descriptor and it's
therefore vfio's responsibility. A device-level IOASID interface
therefore requires that vfio manage the group aspect of device access.
AFAICT, that means that device access can therefore only begin when all
devices for a given group are attached to the IOASID and must halt for
all devices in the group if any device is ever detached from an IOASID,
even temporarily. That suggests a lot more oversight of the IOASIDs by
vfio than I'd prefer.

> 2 is achieved by ensuring that two security contexts can't open
> devices in the same group. Today the group fd deals with this by being
> single open. With iommu_fd the kenerl would not permit splitting
> groups between iommu_fds.

"Who" within the kernel? Is it the IOASID code itself or is this
another responsibility of vfio? If IOASID knows about groups for this,
it's not clear to me why we have a device-level bind interface. A
group-level bind interface clearly makes this more explicit.

> 3 is achieved today by the group_fd enforcing a single IOASID on all
> devices. Under iommu_fd all devices in the group can use any IOASID in
> their iommu_fd security domain.

As above, while the group is the minimum "security context" for vfio,
the overall security context is much more broad. The group-level IOMMU
context is a simplification that allows us to combine isolation and
address-ability and so far it's not clear to me that the IOASID model
is also willing to take over these responsibility. So again, if vfio
needs to manage these aspects that implies a lot of oversight of the
IOASID by vfio.

> It is a slightly different model than VFIO uses, but I don't think it
> provides less isolation.

I can be done correctly, but if IOASID isn't willing to take on
responsibility of managing isolation of the group, then it implies a
non-trivial degree of management by users like vfio to make sure
userspace access is and remains secure.

> > Otherwise, for a device centric VFIO/IOASID model, I need to understand
> > exactly when and how VFIO can know that it's safe to provide access to
> > a device and how the IOASID model guarantees the ongoing safety of that
> > access, which must encompass the safety relative to the entire group.
>
> Lets agree on what safety means then we can evaluate it.

Largely rule #1

> > For example, is it VFIO's job to BIND every device in the group?
>
> I'm thinking no

Then who? Userspace? IOASID?

> > Does binding the device represent the point at which the IOASID
> > takes responsibility for the isolation of the device?
>
> Following Kevin's language BIND is when the device_fd and iommu_fd are
> connected. That is when I see the device as becoming usable. Whatever
> security/isolation requirements we decide should be met here

If device access is usable after a BIND, then that suggests the IOASID
must be managing the group. So why then do we have a device interface
for BIND rather than a group interface?

For example, given a group with devices A and B, the user performs a
BIND of deviceA_fd through vfio and now has access to device A. The
user then performs BIND of deviceB_fd through vfio and has access to
device B. BUT, something must have already taken on management of
device B in order to provide access to device A, so what's the point of
the BIND on device B? Why isn't it a group interface?

> > If instead it's the ATTACH of a device that provides the isolation,
> > how is VFIO supposed to
>
> Not the attach
>
> > DETACH occur through the IOASIDfd rather than the VFIOfd? It seems
> > like the IOASIDfd is going to need ways to manipulate device:IOASID
> > mappings outside of VFIO, so again I wonder if we should switch to an
> > IOASID uAPI at that point rather than using VFIO. Thanks,
>
> I don't think so... When the VFIO device_fd is closed it should
> disonnect the iommu from its device, restore the blocked DMA
> configuration, and then remove itself from the iommu_fd.

So continuing the above example, releasing deviceA_fd does what at the
group level? What if device A and B are DMA aliases of each other?
How does the group remain secure relative to userspace access via
device B?

> Once the device is back to blocked DMA there is no further need for
> the iommu_fd to touch it.

Blocked by whom? An IOMMU group assumes we cannot block DMA between
devices within the same group. In vfio, even an unused devices that's
a member of an in-use group is placed into the IOMMU context of the
group, so a driver attaching to it that wants to do DMA will fail.

I'm really not seeing at all how this implicit group management is
supposed to work. By making it implicit it's clearly too easily
ignored, by the user dependencies and the implementation proposal.
Thanks,

Alex

2021-06-14 19:43:31

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: Plan for /dev/ioasid RFC v2

On Mon, Jun 14, 2021 at 10:28:14AM -0600, Alex Williamson wrote:
> > To my mind it is these three things:
> >
> > 1. The device can only do DMA to memory put into its security context
>
> System memory or device memory, yes.
>
> Corollary: The IOMMU group defines the minimum set of devices where the
> IOMMU can control inter-device DMA.

Inter-device DMA is #2:

> > 2. No other security context can control this device
> > 3. No other security context can do DMA to my userspace memory
>
> Rule #1 is essentially the golden rule, the rest falls out from it.

But we can agree we can use these three principals to evaluate
any design? It is useful to split up the 'derived' ideas to make it
easier to understand.

> > Today in VFIO the security context is the group fd. I would like the
> > security context to be the iommu fd.
>
> The vfio group is simply a representation of the IOMMU group, which is
> the minimum isolation granularity. The group is therefore the minimum
> security context, but itself is not a security context. The overall
> security context for vfio is the set of containers (IOMMU contexts)
> owned by a user, where each container defines the IOMMU context for a
> set of groups. The user can map process and device memory between
> containers within the same security context.

A "security context" needs to be a concrete thing. It should be
something a process has some control over, in these models here all
security contexts are FDs.

If it isn't the group FD then all that is left is the container FD?

> As you know, we have various issues with invalidation of device
> mappings between containers, so simplifying the security context to the
> ioasidfd seems like a good plan. The vfio notion of a container is
> already encompassed in the IOASID of the ioasidfd.

Yes

> The significant difference is therefore the device level IOASID versus
> vfio's group level container granularity. This means the IOASID model
> needs to incorporate the group model not only in terms of isolation,
> but also address-ability.

Yes, we need a definition for what groups mean in this world. Groups
no longer mean a single IOASID for every device in the group.

> > 1 is achieved by ensuring the device is always connected to an
>
> s/device/group/

We have been talking about an iommu world where each device in a group
can have its own IOASIDs, so it no longer makes sense to talk about
groups as an assignable unit. Yes there is that degenerate case where
all devices in a group must have the same IOASID, but in general we
should be talkin about devices, not groups, being assigned IOASIDs.

> As you note in reply to Kevin, in a multi-device group rule #1 can be
> violated if only one device is connected to an IOASID.

I assume that all devices under VFIO control are connected to safe
IOASIDs. A safe IOASID is one that blocks all DMA, or one that is
from the same iommu_fd. A device under VFIO control should not be
pointed at some kernel-owned IOASID that is DMA capable. It is a
change from today.

Connecting two devices in the same group to IOASIDs in different
iommu_fd's would be blocked by the kernel, preventing #1.

> > IOASID. Today the group fd requires an IOASID before it hands out a
> > device_fd. With iommu_fd the device_fd will not allow IOCTLs until it
> > has a blocked DMA IOASID and is successefully joined to an iommu_fd.
>
> Which is the root of my concern. Who owns ioctls to the device fd?
> It's my understanding this is a vfio provided file descriptor and it's
> therefore vfio's responsibility.

Yes, VFIO

> A device-level IOASID interface therefore requires that vfio manage
> the group aspect of device access.

I envision it as some kernel call that vfio will do as part of the
bind ioctl:

iommu_fd_bind_device(vfio_dev->dev, iommu_fd, ...);

If everything is secure it succeeds and VFIO can allow this FD to
operate and process the rest of the ioctls. The new iommu_fd would
make the security decision. The security decision would look at groups
internally.

Three emails ago I outlined what I thought the logic of this function
should look like

> AFAICT, that means that device access can therefore only begin when
> all devices for a given group are attached to the IOASID and must
> halt for all devices in the group if any device is ever detached
> from an IOASID, even temporarily.

Which rule is broken if one device is attached and the other device is
left with no working device_fd?

No working device_fd means no mmap, no MMIO access, no DMA control of
the device. It is very similar to not doing the group_fd IOCTL to get
a device_fd in the first place.

Remember the IOASID for the unused devices will be pointing at
something safe.

> > 2 is achieved by ensuring that two security contexts can't open
> > devices in the same group. Today the group fd deals with this by being
> > single open. With iommu_fd the kenerl would not permit splitting
> > groups between iommu_fds.
>
> "Who" within the kernel? Is it the IOASID code itself or is this
> another responsibility of vfio?

ioasid code. The iommu_fd_bind_device() would keep track of the single
iommu_fd that is allowed to use devices in this group.

> If IOASID knows about groups for this, it's not clear to me why we
> have a device-level bind interface. A group-level bind interface
> clearly makes this more explicit.

It does make it more explicit, but at the cost of introducing another
additional userspace object to manage - we still have to make the
whole API work on a per-device basis. Basically, adding the group
introduces a complexity, I would like us to all agree we need the
group and what exactly it is doing before we do that.

> > 3 is achieved today by the group_fd enforcing a single IOASID on all
> > devices. Under iommu_fd all devices in the group can use any IOASID in
> > their iommu_fd security domain.
>
> As above, while the group is the minimum "security context" for vfio,
> the overall security context is much more broad.

I don't understand this comment, can you describe what scenario would
be causing a problem?

> > It is a slightly different model than VFIO uses, but I don't think it
> > provides less isolation.
>
> I can be done correctly, but if IOASID isn't willing to take on
> responsibility of managing isolation of the group, then it implies a
> non-trivial degree of management by users like vfio to make sure
> userspace access is and remains secure.

I think it is important that the ioasid side do this - otherwise it
feels incomplete to me. VFIO handling it means that logic won't work
for other non-VFIO users, which feels wrong - even if those cases
probably have 1:1 device/group ratio.

> > > For example, is it VFIO's job to BIND every device in the group?
> >
> > I'm thinking no
>
> Then who? Userspace? IOASID?

Userspace would bind the device it wants

> > > Does binding the device represent the point at which the IOASID
> > > takes responsibility for the isolation of the device?
> >
> > Following Kevin's language BIND is when the device_fd and iommu_fd are
> > connected. That is when I see the device as becoming usable. Whatever
> > security/isolation requirements we decide should be met here
>
> If device access is usable after a BIND, then that suggests the IOASID
> must be managing the group. So why then do we have a device interface
> for BIND rather than a group interface?

This would be the only place the group would be used in the iommu_fd
API - and it is conveying redundant information - so do we need it?

> For example, given a group with devices A and B, the user performs a
> BIND of deviceA_fd through vfio and now has access to device A. The
> user then performs BIND of deviceB_fd through vfio and has access to
> device B.

Bind B would fail, iommu_fd_bind_device() will fail because a group
can only have devices in a single iommu_fd.

> So continuing the above example, releasing deviceA_fd does what at the
> group level? What if device A and B are DMA aliases of each other?
> How does the group remain secure relative to userspace access via
> device B?

It is very similar to today, if you close deviceA_fd the only way to
re-open it is is via the existing group_fd. It remains parked while
closed.

With iommufd If you close a deviceA_fd then it cannot be operated
until it is re-bound to the same iommu_fd that the other group members
are in. It remains parked, including with an IOASID that is either
block DMA, or an IOASID from the iommu_fd that is operating the other
devices - same as today.

> > Once the device is back to blocked DMA there is no further need for
> > the iommu_fd to touch it.
>
> Blocked by whom? An IOMMU group assumes we cannot block DMA between
> devices within the same group.

You asked if the iommu_fd needs to change things about the device -
the answer is no. Once the device's IOASID is set to 'block dma' there
is no further actions that can be done to it.

I'm still not seeing your objection concretely, sorry

Jason

2021-06-15 03:15:25

by Tian, Kevin

[permalink] [raw]
Subject: RE: Plan for /dev/ioasid RFC v2

> From: Jason Gunthorpe <[email protected]>
> Sent: Monday, June 14, 2021 9:38 PM
>
> On Mon, Jun 14, 2021 at 03:09:31AM +0000, Tian, Kevin wrote:
>
> > If a device can be always blocked from accessing memory in the IOMMU
> > before it's bound to a driver or more specifically before the driver
> > moves it to a new security context, then there is no need for VFIO
> > to track whether IOASIDfd has taken over ownership of the DMA
> > context for all devices within a group.
>
> I've been assuming we'd do something like this, where when a device is
> first turned into a VFIO it tells the IOMMU layer that this device
> should be DMA blocked unless an IOASID is attached to
> it. Disconnecting an IOASID returns it to blocked.

Or just make sure a device is in block-DMA when it's unbound from a
driver or a security context. Then no need to explicitly tell IOMMU layer
to do so when it's bound to a new driver.

Currently the default domain type applies even when a device is not
bound. This implies that if iommu=passthrough a device is always
allowed to access arbitrary system memory with or without a driver.
I feel the current domain type (identity, dma, unmanged) should apply
only when a driver is loaded...

>
> > If this works I didn't see the need for vfio to keep the sequence.
> > VFIO still keeps group fd to claim ownership of all devices in a
> > group.
>
> As Alex says you still have to deal with the problem that device A in
> a group can gain control of device B in the same group.

There is no isolation in the group then how could vfio prevent device
A from gaining control of device B? for example when both are attached
to the same GPA address space with device MMIO bar included, devA
can do p2p to devB. It's all user's policy how to deal with devices within
the group.

>
> This means device A and B can not be used from to two different
> security contexts.

It depends on how the security context is defined. From iommu layer
p.o.v, an IOASID is a security context which isolates a device from
the rest of the system (but not the sibling in the same group). As you
suggested earlier, it's completely sane if an user wants to attach
devices in a group to different IOASIDs. Here I just talk about this fact.

>
> If the /dev/iommu FD is the security context then the tracking is
> needed there.
>

As I replied to Alex, my point is that VFIO doesn't need to know the
attaching status of each device in a group before it can allow user to
access a device. As long as a device in a group either in block DMA
or switch to a new address space created via /dev/iommu FD, there's
no problem to allow user accessing it. User cannot do harm to the
world outside of the group. User knows there is no isolation within
the group. that is it.

Thanks
Kevin

2021-06-15 03:15:54

by Tian, Kevin

[permalink] [raw]
Subject: RE: Plan for /dev/ioasid RFC v2

> From: Alex Williamson <[email protected]>
> Sent: Monday, June 14, 2021 11:23 AM
>
[...]
> > In the meantime, I'm thinking about another way whether group
> > security can be enforced in the iommu layer to relax the uAPI design.
> > If a device can be always blocked from accessing memory in the
> > IOMMU before it's bound to a driver or more specifically before
> > the driver moves it to a new security context, then there is no need
> > for VFIO to track whether IOASIDfd has taken over ownership of
> > the DMA context for all devices within a group.
>
> But we know we don't have IOMMU level isolation between devices in the
> same group, so I don't see how this helps us.
>
> > But as you said this cannot be achieved via existing default domain
> > approach. So far a device is always attached to a domain:
> >
> > - DOMAIN_IDENTITY: a default domain without DMA protection
> > - DOMAIN_DMA: a default domain with DMA protection via DMA
> > API and iommu core
> > - DOMAIN_UNMANAGED: a driver-created domain which is not
> > managed by iommu core.
> >
> > The special sequence in current vfio group design is to mitigate
> > the 1st case, i.e. if a device is left in passthrough mode before
> > bound to VFIO it's definitely insecure to allow user to access it.
> > Then the sequence ensures that the user access is granted on it
> > only after all devices within a group switch to a security context.
> >
> > Now if the new proposed scheme can be supported, a device
> > is always in a security context (block-dma) before it's switched
> > to a new security context and existing domain types should be
> > applied only in the new context when the device starts to do
> > DMAs. For VFIO case this switch happens explicitly when attaching
> > the device to an IOASID. For kernel driver it's implicit e.g. could
> > happen when the 1st DMA API call is received.
> >
> > If this works I didn't see the need for vfio to keep the sequence.
> > VFIO still keeps group fd to claim ownership of all devices in a
> > group. Once it's done, vfio doesn't need to track the device attach
> > status and user access can be always granted regardless of
> > how the attach status changes. Moving a device from IOASID1
> > to IOASID2 involves detaching from IOASID1 (back to blocked
> > dma context) and then reattaching to IOASID2 (switch to a
> > new security context).
> >
> > Following this direction even IOASIDfd doesn't need to verify
> > the group attach upon such guarantee from the iommu layer.
> > The devices within a group can be in different security contexts,
> > e.g. with some devices attached to GPA IOASID while others not
> > attached. In this way vfio userspace could choose to not attach
> > every device of a group to sustain the current semantics.
>
> It seems like this entirely misses the point of groups with multiple
> devices. If we had IOMMU level isolation between all devices, we'd
> never have multi-device groups. Thanks,
>

If multiple devices in a group are all in a block-DMA state when the
group is attached to vfio, why does vfio need to know whether they
have all been switched to a new security context via IOASIDfd before
it grants user access to a device in a group? Yes, there is no isolation
between devices within a group, but from iommu p.o.v they are all
blocked from touching the rest of the system thus having user access
them won't cause any security problem. Then it's just user's call about
how it tolerates lacking of isolation within that group:

1) User could attach all devices in the group to a single IOASID;
2) User could attach some devices in the group to a single IOASID,
leaving other devices still in block-DMA state;
3) User could attach some devices in the group to IOASID1 and others
to IOASID2, e.g. when the group is created due to !ACS and the two
address spaces are carefully tweaked to not cause undesired p2p
traffic;

In any point in above use cases, the devices within a group are always
in a security context which isolates them from the rest of the system
(though no isolation in-between).

Thanks
Kevin

2021-06-15 03:16:27

by Tian, Kevin

[permalink] [raw]
Subject: RE: Plan for /dev/ioasid RFC v2

> From: Alex Williamson <[email protected]>
> Sent: Tuesday, June 15, 2021 12:28 AM
>
[...]
> > IOASID. Today the group fd requires an IOASID before it hands out a
> > device_fd. With iommu_fd the device_fd will not allow IOCTLs until it
> > has a blocked DMA IOASID and is successefully joined to an iommu_fd.
>
> Which is the root of my concern. Who owns ioctls to the device fd?
> It's my understanding this is a vfio provided file descriptor and it's
> therefore vfio's responsibility. A device-level IOASID interface
> therefore requires that vfio manage the group aspect of device access.
> AFAICT, that means that device access can therefore only begin when all
> devices for a given group are attached to the IOASID and must halt for
> all devices in the group if any device is ever detached from an IOASID,
> even temporarily. That suggests a lot more oversight of the IOASIDs by
> vfio than I'd prefer.
>

This is possibly the point that is worthy of more clarification and
alignment, as it sounds like the root of controversy here.

I feel the goal of vfio group management is more about ownership, i.e.
all devices within a group must be assigned to a single user. Following
the three rules defined by Jason, what we really care is whether a group
of devices can be isolated from the rest of the world, i.e. no access to
memory/device outside of its security context and no access to its
security context from devices outside of this group. This can be achieved
as long as every device in the group is either in block-DMA state when
it's not attached to any security context or attached to an IOASID context
in IOMMU fd.

As long as group-level isolation is satisfied, how devices within a group
are further managed is decided by the user (unattached, all attached to
same IOASID, attached to different IOASIDs) as long as the user
understands the implication of lacking of isolation within the group. This
is what a device-centric model comes to play. Misconfiguration just hurts
the user itself.

If this rationale can be agreed, then I didn't see the point of having VFIO
to mandate all devices in the group must be attached/detached in
lockstep.

Thanks
Kevin

2021-06-15 16:15:01

by Alex Williamson

[permalink] [raw]
Subject: Re: Plan for /dev/ioasid RFC v2

On Tue, 15 Jun 2021 02:31:39 +0000
"Tian, Kevin" <[email protected]> wrote:

> > From: Alex Williamson <[email protected]>
> > Sent: Tuesday, June 15, 2021 12:28 AM
> >
> [...]
> > > IOASID. Today the group fd requires an IOASID before it hands out a
> > > device_fd. With iommu_fd the device_fd will not allow IOCTLs until it
> > > has a blocked DMA IOASID and is successefully joined to an iommu_fd.
> >
> > Which is the root of my concern. Who owns ioctls to the device fd?
> > It's my understanding this is a vfio provided file descriptor and it's
> > therefore vfio's responsibility. A device-level IOASID interface
> > therefore requires that vfio manage the group aspect of device access.
> > AFAICT, that means that device access can therefore only begin when all
> > devices for a given group are attached to the IOASID and must halt for
> > all devices in the group if any device is ever detached from an IOASID,
> > even temporarily. That suggests a lot more oversight of the IOASIDs by
> > vfio than I'd prefer.
> >
>
> This is possibly the point that is worthy of more clarification and
> alignment, as it sounds like the root of controversy here.
>
> I feel the goal of vfio group management is more about ownership, i.e.
> all devices within a group must be assigned to a single user. Following
> the three rules defined by Jason, what we really care is whether a group
> of devices can be isolated from the rest of the world, i.e. no access to
> memory/device outside of its security context and no access to its
> security context from devices outside of this group. This can be achieved
> as long as every device in the group is either in block-DMA state when
> it's not attached to any security context or attached to an IOASID context
> in IOMMU fd.
>
> As long as group-level isolation is satisfied, how devices within a group
> are further managed is decided by the user (unattached, all attached to
> same IOASID, attached to different IOASIDs) as long as the user
> understands the implication of lacking of isolation within the group. This
> is what a device-centric model comes to play. Misconfiguration just hurts
> the user itself.
>
> If this rationale can be agreed, then I didn't see the point of having VFIO
> to mandate all devices in the group must be attached/detached in
> lockstep.

In theory this sounds great, but there are still too many assumptions
and too much hand waving about where isolation occurs for me to feel
like I really have the complete picture. So let's walk through some
examples. Please fill in and correct where I'm wrong.

1) A dual-function PCIe e1000e NIC where the functions are grouped
together due to ACS isolation issues.

a) Initial state: functions 0 & 1 are both bound to e1000e driver.

b) Admin uses driverctl to bind function 1 to vfio-pci, creating
vfio device file, which is chmod'd to grant to a user.

c) User opens vfio function 1 device file and an iommu_fd, binds
device_fd to iommu_fd.

Does this succeed?
- if no, specifically where does it fail?
- if yes, vfio can now allow access to the device?

d) Repeat b) for function 0.

e) Repeat c), still using function 1, is it different? Where? Why?

2) The same NIC as 1)

a) Initial state: functions 0 & 1 bound to vfio-pci, vfio device
files granted to user, user has bound both device_fds to the same
iommu_fd.

AIUI, even though not bound to an IOASID, vfio can now enable access
through the device_fds, right? What specific entity has placed these
devices into a block DMA state, when, and how?

b) Both devices are attached to the same IOASID.

Are we assuming that each device was atomically moved to the new
IOMMU context by the IOASID code? What if the IOMMU cannot change
the domain atomically?

c) The device_fd for function 1 is detached from the IOASID.

Are we assuming the reverse of b) performed by the IOASID code?

d) The device_fd for function 1 is unbound from the iommu_fd.

Does this succeed?
- if yes, what is the resulting IOMMU context of the device and
who owns it?
- if no, well, that results in numerous tear-down issues.

e) Function 1 is unbound from vfio-pci.

Does this work or is it blocked? If blocked, by what entity
specifically?

f) Function 1 is bound to e1000e driver.

We clearly have a violation here, specifically where and by who in
this path should have prevented us from getting here or who pushes
the BUG_ON to abort this?

3) A dual-function conventional PCI e1000 NIC where the functions are
grouped together due to shared RID.

a) Repeat 2.a) and 2.b) such that we have a valid, user accessible
devices in the same IOMMU context.

b) Function 1 is detached from the IOASID.

I think function 1 cannot be placed into a different IOMMU context
here, does the detach work? What's the IOMMU context now?

c) A new IOASID is alloc'd within the existing iommu_fd and function
1 is attached to the new IOASID.

Where, how, by whom does this fail?

If vfio gets to offload all of it's group management to IOASID code,
that's great, but I'm afraid that IOASID is so focused on a
device-level API that we're instead just ignoring the group dynamics
and vfio will be forced to provide oversight to maintain secure
userspace access. Thanks,

Alex

2021-06-15 17:00:05

by Alex Williamson

[permalink] [raw]
Subject: Re: Plan for /dev/ioasid RFC v2

On Tue, 15 Jun 2021 01:21:35 +0000
"Tian, Kevin" <[email protected]> wrote:

> > From: Jason Gunthorpe <[email protected]>
> > Sent: Monday, June 14, 2021 9:38 PM
> >
> > On Mon, Jun 14, 2021 at 03:09:31AM +0000, Tian, Kevin wrote:
> >
> > > If a device can be always blocked from accessing memory in the IOMMU
> > > before it's bound to a driver or more specifically before the driver
> > > moves it to a new security context, then there is no need for VFIO
> > > to track whether IOASIDfd has taken over ownership of the DMA
> > > context for all devices within a group.
> >
> > I've been assuming we'd do something like this, where when a device is
> > first turned into a VFIO it tells the IOMMU layer that this device
> > should be DMA blocked unless an IOASID is attached to
> > it. Disconnecting an IOASID returns it to blocked.
>
> Or just make sure a device is in block-DMA when it's unbound from a
> driver or a security context. Then no need to explicitly tell IOMMU layer
> to do so when it's bound to a new driver.
>
> Currently the default domain type applies even when a device is not
> bound. This implies that if iommu=passthrough a device is always
> allowed to access arbitrary system memory with or without a driver.
> I feel the current domain type (identity, dma, unmanged) should apply
> only when a driver is loaded...

Note that vfio does not currently require all devices in the group to
be bound to drivers. Other devices within the group, those bound to
vfio drivers, can be used in this configuration. This is not
necessarily recommended though as a non-vfio, non-stub driver binding
to one of those devices can trigger a BUG_ON.

> > > If this works I didn't see the need for vfio to keep the sequence.
> > > VFIO still keeps group fd to claim ownership of all devices in a
> > > group.
> >
> > As Alex says you still have to deal with the problem that device A in
> > a group can gain control of device B in the same group.
>
> There is no isolation in the group then how could vfio prevent device
> A from gaining control of device B? for example when both are attached
> to the same GPA address space with device MMIO bar included, devA
> can do p2p to devB. It's all user's policy how to deal with devices within
> the group.

The latter is user policy, yes, but it's a system security issue that
the user cannot use device A to control device B if the user doesn't
have access to both devices, ie. doesn't own the group. vfio would
prevent this by not allowing access to device A while device B is
insecure and would require that all devices within the group remain in
a secure, user owned state for the extent of access to device A.

> > This means device A and B can not be used from to two different
> > security contexts.
>
> It depends on how the security context is defined. From iommu layer
> p.o.v, an IOASID is a security context which isolates a device from
> the rest of the system (but not the sibling in the same group). As you
> suggested earlier, it's completely sane if an user wants to attach
> devices in a group to different IOASIDs. Here I just talk about this fact.

This is sane, yes, but that doesn't give us license to allow the user
to access device A regardless of the state of device B.

> >
> > If the /dev/iommu FD is the security context then the tracking is
> > needed there.
> >
>
> As I replied to Alex, my point is that VFIO doesn't need to know the
> attaching status of each device in a group before it can allow user to
> access a device. As long as a device in a group either in block DMA
> or switch to a new address space created via /dev/iommu FD, there's
> no problem to allow user accessing it. User cannot do harm to the
> world outside of the group. User knows there is no isolation within
> the group. that is it.

This is self contradictory, "vfio doesn't need to know the attachment
status"... "[a]s long as a device in a group either in block DMA or
switch to a new address space". So vfio does need to know the latter.
How does it know that? Thanks,

Alex

2021-06-16 06:44:35

by Tian, Kevin

[permalink] [raw]
Subject: RE: Plan for /dev/ioasid RFC v2

> From: Alex Williamson <[email protected]>
> Sent: Wednesday, June 16, 2021 12:12 AM
>
> On Tue, 15 Jun 2021 02:31:39 +0000
> "Tian, Kevin" <[email protected]> wrote:
>
> > > From: Alex Williamson <[email protected]>
> > > Sent: Tuesday, June 15, 2021 12:28 AM
> > >
> > [...]
> > > > IOASID. Today the group fd requires an IOASID before it hands out a
> > > > device_fd. With iommu_fd the device_fd will not allow IOCTLs until it
> > > > has a blocked DMA IOASID and is successefully joined to an iommu_fd.
> > >
> > > Which is the root of my concern. Who owns ioctls to the device fd?
> > > It's my understanding this is a vfio provided file descriptor and it's
> > > therefore vfio's responsibility. A device-level IOASID interface
> > > therefore requires that vfio manage the group aspect of device access.
> > > AFAICT, that means that device access can therefore only begin when all
> > > devices for a given group are attached to the IOASID and must halt for
> > > all devices in the group if any device is ever detached from an IOASID,
> > > even temporarily. That suggests a lot more oversight of the IOASIDs by
> > > vfio than I'd prefer.
> > >
> >
> > This is possibly the point that is worthy of more clarification and
> > alignment, as it sounds like the root of controversy here.
> >
> > I feel the goal of vfio group management is more about ownership, i.e.
> > all devices within a group must be assigned to a single user. Following
> > the three rules defined by Jason, what we really care is whether a group
> > of devices can be isolated from the rest of the world, i.e. no access to
> > memory/device outside of its security context and no access to its
> > security context from devices outside of this group. This can be achieved
> > as long as every device in the group is either in block-DMA state when
> > it's not attached to any security context or attached to an IOASID context
> > in IOMMU fd.
> >
> > As long as group-level isolation is satisfied, how devices within a group
> > are further managed is decided by the user (unattached, all attached to
> > same IOASID, attached to different IOASIDs) as long as the user
> > understands the implication of lacking of isolation within the group. This
> > is what a device-centric model comes to play. Misconfiguration just hurts
> > the user itself.
> >
> > If this rationale can be agreed, then I didn't see the point of having VFIO
> > to mandate all devices in the group must be attached/detached in
> > lockstep.
>
> In theory this sounds great, but there are still too many assumptions
> and too much hand waving about where isolation occurs for me to feel
> like I really have the complete picture. So let's walk through some
> examples. Please fill in and correct where I'm wrong.

Thanks for putting these examples. They are helpful for clearing the
whole picture.

Before filling in let's first align on what is the key difference between
current VFIO model and this new proposal. With this comparison we'll
know which of following questions are answered with existing VFIO
mechanism and which are handled differently.

With Yi's help we figured out the current mechanism:

1) vfio_group_viable. The code comment explains the intention clearly:

--
* A vfio group is viable for use by userspace if all devices are in
* one of the following states:
* - driver-less
* - bound to a vfio driver
* - bound to an otherwise allowed driver
* - a PCI interconnect device
--

Note this check is not related to an IOMMU security context.

2) vfio_iommu_group_notifier. When an IOMMU_GROUP_NOTIFY_
BOUND_DRIVER event is notified, vfio_group_viable is re-evaluated.
If the affected group was previously viable but now becomes not
viable, BUG_ON() as it implies that this device is bound to a non-vfio
driver which breaks the group isolation.

3) vfio_group_get_device_fd. User can acquire a device fd only after
a) the group is viable;
b) the group is attached to a container;
c) iommu is set on the container (implying a security context
established);

The new device-centric proposal suggests:

1) vfio_group_viable;
2) vfio_iommu_group_notifier;
3) block-DMA if a device is detached from previous domain (instead of
switching back to default domain as today);
4) vfio_group_get_device_fd. User can acquire a device fd once the group
is viable;
5) device-centric when binding to IOMMU fd or attaching to IOASID

In this model the group viability mechanism is kept but there is no need
for VFIO to track the actual attaching status.

Now let's look at how the new model works.

>
> 1) A dual-function PCIe e1000e NIC where the functions are grouped
> together due to ACS isolation issues.
>
> a) Initial state: functions 0 & 1 are both bound to e1000e driver.
>
> b) Admin uses driverctl to bind function 1 to vfio-pci, creating
> vfio device file, which is chmod'd to grant to a user.

This implies that function 1 is in block-DMA mode when it's unbound
from e1000e.

>
> c) User opens vfio function 1 device file and an iommu_fd, binds
> device_fd to iommu_fd.

User should check group viability before step c).

>
> Does this succeed?
> - if no, specifically where does it fail?
> - if yes, vfio can now allow access to the device?
>

with group viability step c) fails.

> d) Repeat b) for function 0.

function 0 is in block DMA mode now.

>
> e) Repeat c), still using function 1, is it different? Where? Why?

it's different because group becomes viable now. Then step c) succeeds.
At this point, both function 0/1 are in block-DMA mode thus isolated
from the rest of the system. VFIO allows the user to access function 1
without the need of knowing when function 1 is attached to a new
context (IOASID) via IOMMU fd and whether function 0 is left detached.

>
> 2) The same NIC as 1)
>
> a) Initial state: functions 0 & 1 bound to vfio-pci, vfio device
> files granted to user, user has bound both device_fds to the same
> iommu_fd.
>
> AIUI, even though not bound to an IOASID, vfio can now enable access
> through the device_fds, right? What specific entity has placed these

yes

> devices into a block DMA state, when, and how?

As explained in 2.b), both devices are put into block-DMA when they
are detached from the default domain which is used when they are
bound to e1000e driver.

>
> b) Both devices are attached to the same IOASID.
>
> Are we assuming that each device was atomically moved to the new
> IOMMU context by the IOASID code? What if the IOMMU cannot change
> the domain atomically?

No. Moving function 0 then function 1, or moving function 0 alone can
all works. The one which hasn't been attached to an IOASID is kept in
block-DMA state.

>
> c) The device_fd for function 1 is detached from the IOASID.
>
> Are we assuming the reverse of b) performed by the IOASID code?

function 1 turns back to block-DMA

>
> d) The device_fd for function 1 is unbound from the iommu_fd.
>
> Does this succeed?
> - if yes, what is the resulting IOMMU context of the device and
> who owns it?
> - if no, well, that results in numerous tear-down issues.

Yes. function 1 is block-DMA while function 0 still attached to IOASID.
Actually unbind from IOMMU fd doesn't change the security context.
the change is conducted when attaching/detaching device to/from an
IOASID.

>
> e) Function 1 is unbound from vfio-pci.
>
> Does this work or is it blocked? If blocked, by what entity
> specifically?

works.

>
> f) Function 1 is bound to e1000e driver.
>
> We clearly have a violation here, specifically where and by who in
> this path should have prevented us from getting here or who pushes
> the BUG_ON to abort this?

via vfio_iommu_group_notifier, same as today.

>
> 3) A dual-function conventional PCI e1000 NIC where the functions are
> grouped together due to shared RID.
>
> a) Repeat 2.a) and 2.b) such that we have a valid, user accessible
> devices in the same IOMMU context.
>
> b) Function 1 is detached from the IOASID.
>
> I think function 1 cannot be placed into a different IOMMU context
> here, does the detach work? What's the IOMMU context now?

Yes. Function 1 is back to block-DMA. Since both functions share RID,
essentially it implies function 0 is in block-DMA state too (though its
tracking state may not change yet) since the shared IOMMU context
entry blocks DMA now. In IOMMU fd function 0 is still attached to the
IOASID thus the user still needs do an explicit detach to clear the
tracking state for function 0.

>
> c) A new IOASID is alloc'd within the existing iommu_fd and function
> 1 is attached to the new IOASID.
>
> Where, how, by whom does this fail?

No need to fail. It can succeed since doing so just hurts user's own foot.

The only question is how user knows the fact that a group of devices
share RID thus avoid such thing. I'm curious how it is communicated
with today's VFIO mechanism. Yes the group-centric VFIO uAPI prevents
a group of devices from attaching to multiple IOMMU contexts, but
suppose we still need a way to tell the user to not do so. Especially
such knowledge would be also reflected in the virtual PCI topology
when the entire group is assigned to the guest which needs to know
this fact when vIOMMU is exposed. I haven't found time to investigate
it but suppose if such channel exists it could be reused, or in the worst
case we may have the new device capability interface to convey...

>
> If vfio gets to offload all of it's group management to IOASID code,
> that's great, but I'm afraid that IOASID is so focused on a
> device-level API that we're instead just ignoring the group dynamics
> and vfio will be forced to provide oversight to maintain secure
> userspace access. Thanks,
>

In summary, the security of the group dynamics are handled through
block-DMA plus existing vfio_group_viable mechanism in this device-
centric design. VFIO still keeps its group management, but no need
to track the attaching status for allowing user access.

Thanks
Kevin

2021-06-16 06:54:19

by Tian, Kevin

[permalink] [raw]
Subject: RE: Plan for /dev/ioasid RFC v2

> From: Alex Williamson <[email protected]>
> Sent: Wednesday, June 16, 2021 12:56 AM
>
> On Tue, 15 Jun 2021 01:21:35 +0000
> "Tian, Kevin" <[email protected]> wrote:
>
> > > From: Jason Gunthorpe <[email protected]>
> > > Sent: Monday, June 14, 2021 9:38 PM
> > >
> > > On Mon, Jun 14, 2021 at 03:09:31AM +0000, Tian, Kevin wrote:
> > >
> > > > If a device can be always blocked from accessing memory in the IOMMU
> > > > before it's bound to a driver or more specifically before the driver
> > > > moves it to a new security context, then there is no need for VFIO
> > > > to track whether IOASIDfd has taken over ownership of the DMA
> > > > context for all devices within a group.
> > >
> > > I've been assuming we'd do something like this, where when a device is
> > > first turned into a VFIO it tells the IOMMU layer that this device
> > > should be DMA blocked unless an IOASID is attached to
> > > it. Disconnecting an IOASID returns it to blocked.
> >
> > Or just make sure a device is in block-DMA when it's unbound from a
> > driver or a security context. Then no need to explicitly tell IOMMU layer
> > to do so when it's bound to a new driver.
> >
> > Currently the default domain type applies even when a device is not
> > bound. This implies that if iommu=passthrough a device is always
> > allowed to access arbitrary system memory with or without a driver.
> > I feel the current domain type (identity, dma, unmanged) should apply
> > only when a driver is loaded...
>
> Note that vfio does not currently require all devices in the group to
> be bound to drivers. Other devices within the group, those bound to
> vfio drivers, can be used in this configuration. This is not
> necessarily recommended though as a non-vfio, non-stub driver binding
> to one of those devices can trigger a BUG_ON.

This is a good learning that I didn't realize before. ????

As explained in previous mail, we need reuse the group_viable mechanism
to trigger BUG_ON.

>
> > > > If this works I didn't see the need for vfio to keep the sequence.
> > > > VFIO still keeps group fd to claim ownership of all devices in a
> > > > group.
> > >
> > > As Alex says you still have to deal with the problem that device A in
> > > a group can gain control of device B in the same group.
> >
> > There is no isolation in the group then how could vfio prevent device
> > A from gaining control of device B? for example when both are attached
> > to the same GPA address space with device MMIO bar included, devA
> > can do p2p to devB. It's all user's policy how to deal with devices within
> > the group.
>
> The latter is user policy, yes, but it's a system security issue that
> the user cannot use device A to control device B if the user doesn't
> have access to both devices, ie. doesn't own the group. vfio would
> prevent this by not allowing access to device A while device B is
> insecure and would require that all devices within the group remain in
> a secure, user owned state for the extent of access to device A.
>
> > > This means device A and B can not be used from to two different
> > > security contexts.
> >
> > It depends on how the security context is defined. From iommu layer
> > p.o.v, an IOASID is a security context which isolates a device from
> > the rest of the system (but not the sibling in the same group). As you
> > suggested earlier, it's completely sane if an user wants to attach
> > devices in a group to different IOASIDs. Here I just talk about this fact.
>
> This is sane, yes, but that doesn't give us license to allow the user
> to access device A regardless of the state of device B.
>
> > >
> > > If the /dev/iommu FD is the security context then the tracking is
> > > needed there.
> > >
> >
> > As I replied to Alex, my point is that VFIO doesn't need to know the
> > attaching status of each device in a group before it can allow user to
> > access a device. As long as a device in a group either in block DMA
> > or switch to a new address space created via /dev/iommu FD, there's
> > no problem to allow user accessing it. User cannot do harm to the
> > world outside of the group. User knows there is no isolation within
> > the group. that is it.
>
> This is self contradictory, "vfio doesn't need to know the attachment
> status"... "[a]s long as a device in a group either in block DMA or
> switch to a new address space". So vfio does need to know the latter.
> How does it know that? Thanks,
>

My point was that, if a device can only be in two states: block-DMA or
attached to a new address space, which both are secure, then vfio
doesn't need to track which state a device is actually in.

Thanks
Kevin

2021-06-17 01:55:59

by Alex Williamson

[permalink] [raw]
Subject: Re: Plan for /dev/ioasid RFC v2

On Wed, 16 Jun 2021 06:43:23 +0000
"Tian, Kevin" <[email protected]> wrote:

> > From: Alex Williamson <[email protected]>
> > Sent: Wednesday, June 16, 2021 12:12 AM
> >
> > On Tue, 15 Jun 2021 02:31:39 +0000
> > "Tian, Kevin" <[email protected]> wrote:
> >
> > > > From: Alex Williamson <[email protected]>
> > > > Sent: Tuesday, June 15, 2021 12:28 AM
> > > >
> > > [...]
> > > > > IOASID. Today the group fd requires an IOASID before it hands out a
> > > > > device_fd. With iommu_fd the device_fd will not allow IOCTLs until it
> > > > > has a blocked DMA IOASID and is successefully joined to an iommu_fd.
> > > >
> > > > Which is the root of my concern. Who owns ioctls to the device fd?
> > > > It's my understanding this is a vfio provided file descriptor and it's
> > > > therefore vfio's responsibility. A device-level IOASID interface
> > > > therefore requires that vfio manage the group aspect of device access.
> > > > AFAICT, that means that device access can therefore only begin when all
> > > > devices for a given group are attached to the IOASID and must halt for
> > > > all devices in the group if any device is ever detached from an IOASID,
> > > > even temporarily. That suggests a lot more oversight of the IOASIDs by
> > > > vfio than I'd prefer.
> > > >
> > >
> > > This is possibly the point that is worthy of more clarification and
> > > alignment, as it sounds like the root of controversy here.
> > >
> > > I feel the goal of vfio group management is more about ownership, i.e.
> > > all devices within a group must be assigned to a single user. Following
> > > the three rules defined by Jason, what we really care is whether a group
> > > of devices can be isolated from the rest of the world, i.e. no access to
> > > memory/device outside of its security context and no access to its
> > > security context from devices outside of this group. This can be achieved
> > > as long as every device in the group is either in block-DMA state when
> > > it's not attached to any security context or attached to an IOASID context
> > > in IOMMU fd.
> > >
> > > As long as group-level isolation is satisfied, how devices within a group
> > > are further managed is decided by the user (unattached, all attached to
> > > same IOASID, attached to different IOASIDs) as long as the user
> > > understands the implication of lacking of isolation within the group. This
> > > is what a device-centric model comes to play. Misconfiguration just hurts
> > > the user itself.
> > >
> > > If this rationale can be agreed, then I didn't see the point of having VFIO
> > > to mandate all devices in the group must be attached/detached in
> > > lockstep.
> >
> > In theory this sounds great, but there are still too many assumptions
> > and too much hand waving about where isolation occurs for me to feel
> > like I really have the complete picture. So let's walk through some
> > examples. Please fill in and correct where I'm wrong.
>
> Thanks for putting these examples. They are helpful for clearing the
> whole picture.
>
> Before filling in let's first align on what is the key difference between
> current VFIO model and this new proposal. With this comparison we'll
> know which of following questions are answered with existing VFIO
> mechanism and which are handled differently.
>
> With Yi's help we figured out the current mechanism:
>
> 1) vfio_group_viable. The code comment explains the intention clearly:
>
> --
> * A vfio group is viable for use by userspace if all devices are in
> * one of the following states:
> * - driver-less
> * - bound to a vfio driver
> * - bound to an otherwise allowed driver
> * - a PCI interconnect device
> --
>
> Note this check is not related to an IOMMU security context.

Because this is a pre-requisite for imposing that IOMMU security
context.

> 2) vfio_iommu_group_notifier. When an IOMMU_GROUP_NOTIFY_
> BOUND_DRIVER event is notified, vfio_group_viable is re-evaluated.
> If the affected group was previously viable but now becomes not
> viable, BUG_ON() as it implies that this device is bound to a non-vfio
> driver which breaks the group isolation.

This notifier action is conditional on there being users of devices
within a secure group IOMMU context.

> 3) vfio_group_get_device_fd. User can acquire a device fd only after
> a) the group is viable;
> b) the group is attached to a container;
> c) iommu is set on the container (implying a security context
> established);

The order is actually b) a) c) but arguably b) is a no-op until:

d) a device fd is provided to the user

> The new device-centric proposal suggests:
>
> 1) vfio_group_viable;
> 2) vfio_iommu_group_notifier;
> 3) block-DMA if a device is detached from previous domain (instead of
> switching back to default domain as today);

I'm literally begging for specifics in this thread, but none are
provided here. What is the "previous domain"? How is a device placed
into a DMA blocking IOMMU context? Is this the IOMMU default domain?
Doesn't that represent a change in IOMMU behavior to place devices into
a blocking DMA context in several of the group-viable scenarios?

> 4) vfio_group_get_device_fd. User can acquire a device fd once the group
> is viable;

But as you've noted, "viable" doesn't test the IOMMU context of the
group devices, it's only a pre-condition for attaching the group to an
IOMMU context for isolated access. What changes in the kernel that
makes "viable" become "isolated"? A device bound to pci-stub today is
certainly not in a DMA blocking context when the host is booted with
iommu=pt. Enabling the IOMMU only for device assignment by using
iommu=pt is arguably the predominant use case of the IOMMU.

> 5) device-centric when binding to IOMMU fd or attaching to IOASID
>
> In this model the group viability mechanism is kept but there is no need
> for VFIO to track the actual attaching status.
>
> Now let's look at how the new model works.
>
> >
> > 1) A dual-function PCIe e1000e NIC where the functions are grouped
> > together due to ACS isolation issues.
> >
> > a) Initial state: functions 0 & 1 are both bound to e1000e driver.
> >
> > b) Admin uses driverctl to bind function 1 to vfio-pci, creating
> > vfio device file, which is chmod'd to grant to a user.
>
> This implies that function 1 is in block-DMA mode when it's unbound
> from e1000e.

Does this require a kernel change from current? Does it require the
host is not in iommu=pt mode? Did vfio or vfio-pci do anything to
impose this DMA blocking context? What if function 1 is actually a DMA
alias of function 0, wouldn't changing function 1's IOMMU context break
the operation of function 0?

> >
> > c) User opens vfio function 1 device file and an iommu_fd, binds
> > device_fd to iommu_fd.
>
> User should check group viability before step c).

Sure, but "user should" is not a viable security model.

> >
> > Does this succeed?
> > - if no, specifically where does it fail?
> > - if yes, vfio can now allow access to the device?
> >
>
> with group viability step c) fails.

I'm asking for specifics, is it vfio's responsibility to test viability
before trying to bind the device_fd to the iommu_fd and it's vfio that
triggers this failure? This sounds like vfio is entirely responsible
for managing the integrity of the group.

> > d) Repeat b) for function 0.
>
> function 0 is in block DMA mode now.

Somehow...

> >
> > e) Repeat c), still using function 1, is it different? Where? Why?
>
> it's different because group becomes viable now. Then step c) succeeds.
> At this point, both function 0/1 are in block-DMA mode thus isolated
> from the rest of the system. VFIO allows the user to access function 1
> without the need of knowing when function 1 is attached to a new
> context (IOASID) via IOMMU fd and whether function 0 is left detached.
>
> >
> > 2) The same NIC as 1)
> >
> > a) Initial state: functions 0 & 1 bound to vfio-pci, vfio device
> > files granted to user, user has bound both device_fds to the same
> > iommu_fd.
> >
> > AIUI, even though not bound to an IOASID, vfio can now enable access
> > through the device_fds, right? What specific entity has placed these
>
> yes
>
> > devices into a block DMA state, when, and how?
>
> As explained in 2.b), both devices are put into block-DMA when they
> are detached from the default domain which is used when they are
> bound to e1000e driver.

How do stub drivers interact with this model? How do PCI interconnect
drivers work with this model? How do DMA alias devices work with this
model? How does iommu=pt work with this model? Does vfio just
passively assume the DMA blocking IOMMU context based on other random
attributes of the device?

> >
> > b) Both devices are attached to the same IOASID.
> >
> > Are we assuming that each device was atomically moved to the new
> > IOMMU context by the IOASID code? What if the IOMMU cannot change
> > the domain atomically?
>
> No. Moving function 0 then function 1, or moving function 0 alone can
> all works. The one which hasn't been attached to an IOASID is kept in
> block-DMA state.

I'm asking whether this can be accomplished atomically relative to
device DMA. If the user has access to the device after the bind
operation and the device operates in a DMA blocking IOMMU context at
that point, it seems that every IOASID context switch must be atomic
relative to device DMA or we present an exploitable gap to the user.

This is another change from vfio, the lifetime of the IOMMU context
encompasses the lifetime of device access.

> >
> > c) The device_fd for function 1 is detached from the IOASID.
> >
> > Are we assuming the reverse of b) performed by the IOASID code?
>
> function 1 turns back to block-DMA
>
> >
> > d) The device_fd for function 1 is unbound from the iommu_fd.
> >
> > Does this succeed?
> > - if yes, what is the resulting IOMMU context of the device and
> > who owns it?
> > - if no, well, that results in numerous tear-down issues.
>
> Yes. function 1 is block-DMA while function 0 still attached to IOASID.
> Actually unbind from IOMMU fd doesn't change the security context.
> the change is conducted when attaching/detaching device to/from an
> IOASID.

But I think you're suggesting that the IOMMU context is simply the
device's default domain, so vfio is left in the position where the user
gained access to the device by binding it to an iommu_fd, but now the
device exists outside of the iommu_fd. Doesn't that make it pointless
to gate device access on binding the device to the iommu_fd? The user
can get an accessible device_fd unbound from an iommu_fd on the reverse
path.

That would mean vfio's only control point for device access is on
open().

> >
> > e) Function 1 is unbound from vfio-pci.
> >
> > Does this work or is it blocked? If blocked, by what entity
> > specifically?
>
> works.
>
> >
> > f) Function 1 is bound to e1000e driver.
> >
> > We clearly have a violation here, specifically where and by who in
> > this path should have prevented us from getting here or who pushes
> > the BUG_ON to abort this?
>
> via vfio_iommu_group_notifier, same as today.

So as above, group integrity remains entirely vfio's issue? Didn't we
discuss elsewhere in this thread that unless group integrity is managed
by /dev/iommu that we're going to have a mess of different consumers
managing it different degrees and effectiveness (or more likely just
ignoring it)?

> >
> > 3) A dual-function conventional PCI e1000 NIC where the functions are
> > grouped together due to shared RID.
> >
> > a) Repeat 2.a) and 2.b) such that we have a valid, user accessible
> > devices in the same IOMMU context.
> >
> > b) Function 1 is detached from the IOASID.
> >
> > I think function 1 cannot be placed into a different IOMMU context
> > here, does the detach work? What's the IOMMU context now?
>
> Yes. Function 1 is back to block-DMA. Since both functions share RID,
> essentially it implies function 0 is in block-DMA state too (though its
> tracking state may not change yet) since the shared IOMMU context
> entry blocks DMA now. In IOMMU fd function 0 is still attached to the
> IOASID thus the user still needs do an explicit detach to clear the
> tracking state for function 0.
>
> >
> > c) A new IOASID is alloc'd within the existing iommu_fd and function
> > 1 is attached to the new IOASID.
> >
> > Where, how, by whom does this fail?
>
> No need to fail. It can succeed since doing so just hurts user's own foot.
>
> The only question is how user knows the fact that a group of devices
> share RID thus avoid such thing. I'm curious how it is communicated
> with today's VFIO mechanism. Yes the group-centric VFIO uAPI prevents
> a group of devices from attaching to multiple IOMMU contexts, but
> suppose we still need a way to tell the user to not do so. Especially
> such knowledge would be also reflected in the virtual PCI topology
> when the entire group is assigned to the guest which needs to know
> this fact when vIOMMU is exposed. I haven't found time to investigate
> it but suppose if such channel exists it could be reused, or in the worst
> case we may have the new device capability interface to convey...

No such channel currently exists, it's not an issue today, IOMMU
context is group-based.

> > If vfio gets to offload all of it's group management to IOASID code,
> > that's great, but I'm afraid that IOASID is so focused on a
> > device-level API that we're instead just ignoring the group dynamics
> > and vfio will be forced to provide oversight to maintain secure
> > userspace access. Thanks,
> >
>
> In summary, the security of the group dynamics are handled through
> block-DMA plus existing vfio_group_viable mechanism in this device-
> centric design. VFIO still keeps its group management, but no need
> to track the attaching status for allowing user access.

Still seems pretty loosely defined to me, the DMA blocking mechanism
isn't specified, there's no verification of the IOMMU context for
"stray" group devices, the group management is based in the IOASID
consumer code leading to varying degrees of implementation and
effectiveness between callers, we lean more heavily on a fragile
notifier to notice and hit the panic button on violation.

It would make a lot more sense to me if the model were for vfio to
bind groups to /dev/iommu, the IOASID code manages group integrity, and
devices can still be moved between IOASIDs as is the overall goal. The
group is the basis of ownership, which makes it a worthwhile part of
the API. Thanks,

Alex

2021-06-17 03:46:10

by Liu Yi L

[permalink] [raw]
Subject: Re: Plan for /dev/ioasid RFC v2

Hi Alex,

On Wed, 16 Jun 2021 13:39:37 -0600, Alex Williamson wrote:

> On Wed, 16 Jun 2021 06:43:23 +0000
> "Tian, Kevin" <[email protected]> wrote:
>
> > > From: Alex Williamson <[email protected]>
> > > Sent: Wednesday, June 16, 2021 12:12 AM
> > >
> > > On Tue, 15 Jun 2021 02:31:39 +0000
> > > "Tian, Kevin" <[email protected]> wrote:
> > >
> > > > > From: Alex Williamson <[email protected]>
> > > > > Sent: Tuesday, June 15, 2021 12:28 AM
> > > > >
> > > > [...]
> > > > > > IOASID. Today the group fd requires an IOASID before it hands out a
> > > > > > device_fd. With iommu_fd the device_fd will not allow IOCTLs until it
> > > > > > has a blocked DMA IOASID and is successefully joined to an iommu_fd.
> > > > >
> > > > > Which is the root of my concern. Who owns ioctls to the device fd?
> > > > > It's my understanding this is a vfio provided file descriptor and it's
> > > > > therefore vfio's responsibility. A device-level IOASID interface
> > > > > therefore requires that vfio manage the group aspect of device access.
> > > > > AFAICT, that means that device access can therefore only begin when all
> > > > > devices for a given group are attached to the IOASID and must halt for
> > > > > all devices in the group if any device is ever detached from an IOASID,
> > > > > even temporarily. That suggests a lot more oversight of the IOASIDs by
> > > > > vfio than I'd prefer.
> > > > >
> > > >
> > > > This is possibly the point that is worthy of more clarification and
> > > > alignment, as it sounds like the root of controversy here.
> > > >
> > > > I feel the goal of vfio group management is more about ownership, i.e.
> > > > all devices within a group must be assigned to a single user. Following
> > > > the three rules defined by Jason, what we really care is whether a group
> > > > of devices can be isolated from the rest of the world, i.e. no access to
> > > > memory/device outside of its security context and no access to its
> > > > security context from devices outside of this group. This can be achieved
> > > > as long as every device in the group is either in block-DMA state when
> > > > it's not attached to any security context or attached to an IOASID context
> > > > in IOMMU fd.
> > > >
> > > > As long as group-level isolation is satisfied, how devices within a group
> > > > are further managed is decided by the user (unattached, all attached to
> > > > same IOASID, attached to different IOASIDs) as long as the user
> > > > understands the implication of lacking of isolation within the group. This
> > > > is what a device-centric model comes to play. Misconfiguration just hurts
> > > > the user itself.
> > > >
> > > > If this rationale can be agreed, then I didn't see the point of having VFIO
> > > > to mandate all devices in the group must be attached/detached in
> > > > lockstep.
> > >
> > > In theory this sounds great, but there are still too many assumptions
> > > and too much hand waving about where isolation occurs for me to feel
> > > like I really have the complete picture. So let's walk through some
> > > examples. Please fill in and correct where I'm wrong.
> >
> > Thanks for putting these examples. They are helpful for clearing the
> > whole picture.
> >
> > Before filling in let's first align on what is the key difference between
> > current VFIO model and this new proposal. With this comparison we'll
> > know which of following questions are answered with existing VFIO
> > mechanism and which are handled differently.
> >
> > With Yi's help we figured out the current mechanism:
> >
> > 1) vfio_group_viable. The code comment explains the intention clearly:
> >
> > --
> > * A vfio group is viable for use by userspace if all devices are in
> > * one of the following states:
> > * - driver-less
> > * - bound to a vfio driver
> > * - bound to an otherwise allowed driver
> > * - a PCI interconnect device
> > --
> >
> > Note this check is not related to an IOMMU security context.
>
> Because this is a pre-requisite for imposing that IOMMU security
> context.
>
> > 2) vfio_iommu_group_notifier. When an IOMMU_GROUP_NOTIFY_
> > BOUND_DRIVER event is notified, vfio_group_viable is re-evaluated.
> > If the affected group was previously viable but now becomes not
> > viable, BUG_ON() as it implies that this device is bound to a non-vfio
> > driver which breaks the group isolation.
>
> This notifier action is conditional on there being users of devices
> within a secure group IOMMU context.
>
> > 3) vfio_group_get_device_fd. User can acquire a device fd only after
> > a) the group is viable;
> > b) the group is attached to a container;
> > c) iommu is set on the container (implying a security context
> > established);
>
> The order is actually b) a) c) but arguably b) is a no-op until:
>
> d) a device fd is provided to the user

Per the code in QEMU vfio_get_group(). The order is a) b) c). In
vfio_connect_container(), group is attached to a container.

1959 VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp)
1960 {
...
1978 group = g_malloc0(sizeof(*group));
1979
1980 snprintf(path, sizeof(path), "/dev/vfio/%d", groupid);
1981 group->fd = qemu_open_old(path, O_RDWR);
1982 if (group->fd < 0) {
1983 error_setg_errno(errp, errno, "failed to open %s", path);
1984 goto free_group_exit;
1985 }
1986
1987 if (ioctl(group->fd, VFIO_GROUP_GET_STATUS, &status)) {
1988 error_setg_errno(errp, errno, "failed to get group %d status", groupid);
1989 goto close_fd_exit;
1990 }
1991
1992 if (!(status.flags & VFIO_GROUP_FLAGS_VIABLE)) {
1993 error_setg(errp, "group %d is not viable", groupid);
1994 error_append_hint(errp,
1995 "Please ensure all devices within the iommu_group "
1996 "are bound to their vfio bus driver.\n");
1997 goto close_fd_exit;
1998 }
1999
2000 group->groupid = groupid;
2001 QLIST_INIT(&group->device_list);
2002
2003 if (vfio_connect_container(group, as, errp)) {
2004 error_prepend(errp, "failed to setup container for group %d: ",
2005 groupid);
2006 goto close_fd_exit;
2007 }
2008
...
2024 }

--
Regards,
Yi Liu

2021-06-17 07:22:55

by David Gibson

[permalink] [raw]
Subject: Re: Plan for /dev/ioasid RFC v2

On Wed, Jun 09, 2021 at 10:15:32AM -0600, Alex Williamson wrote:
> On Wed, 9 Jun 2021 17:51:26 +0200
> Joerg Roedel <[email protected]> wrote:
>
> > On Wed, Jun 09, 2021 at 12:00:09PM -0300, Jason Gunthorpe wrote:
> > > Only *drivers* know what the actual device is going to do, devices do
> > > not. Since the group doesn't have drivers it is the wrong layer to be
> > > making choices about how to configure the IOMMU.
> >
> > Groups don't carry how to configure IOMMUs, that information is
> > mostly in the IOMMU domains. And those (or an abstraction of them) is
> > configured through /dev/ioasid. So not sure what you wanted to say with
> > the above.
> >
> > All a group carries is information about which devices are not
> > sufficiently isolated from each other and thus need to always be in the
> > same domain.
> >
> > > The device centric approach is my attempt at this, and it is pretty
> > > clean, I think.
> >
> > Clean, but still insecure.
> >
> > > All ACS does is prevent P2P operations, if you assign all the group
> > > devices into the same /dev/iommu then you may not care about that
> > > security isolation property. At the very least it is policy for user
> > > to decide, not kernel.
> >
> > It is a kernel decision, because a fundamental task of the kernel is to
> > ensure isolation between user-space tasks as good as it can. And if a
> > device assigned to one task can interfer with a device of another task
> > (e.g. by sending P2P messages), then the promise of isolation is broken.
>
> AIUI, the IOASID model will still enforce IOMMU groups, but it's not an
> explicit part of the interface like it is for vfio. For example the
> IOASID model allows attaching individual devices such that we have
> granularity to create per device IOASIDs, but all devices within an
> IOMMU group are required to be attached to an IOASID before they can be
> used. It's not entirely clear to me yet how that last bit gets
> implemented though, ie. what barrier is in place to prevent device
> usage prior to reaching this viable state.
>
> > > Groups should be primarily about isolation security, not about IOASID
> > > matching.
> >
> > That doesn't make any sense, what do you mean by 'IOASID matching'?
>
> One of the problems with the vfio interface use of groups is that we
> conflate the IOMMU group for both isolation and granularity. I think
> what Jason is referring to here is that we still want groups to be the
> basis of isolation, but we don't want a uAPI that presumes all devices
> within the group must use the same IOASID. For example, if a user owns
> an IOMMU group consisting of non-isolated functions of a multi-function
> device, they should be able to create a vIOMMU VM where each of those
> functions has its own address space. That can't be done today, the
> entire group would need to be attached to the VM under a PCIe-to-PCI
> bridge to reflect the address space limitation imposed by the vfio
> group uAPI model. Thanks,

I'm fairly sceptical of the idea of allowing the "identifiable
requestor" grouping to be different from the isolation grouping.
Certainly it's possible in hardware, but I think it makes the
interface horribly complex to understand without buying much.

"Good" modern devices on modern systems will be both fully isolated
and well identified, so for the uses cases that people seem to mostly
care about here we'll still have identification group == isolation
group == one device.

In other words, do we really have use cases where we need to identify
different devices IDs, even though we know they're not isolated.

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (3.79 kB)
signature.asc (849.00 B)
Download all attachments

2021-06-17 07:23:47

by David Gibson

[permalink] [raw]
Subject: Re: Plan for /dev/ioasid RFC v2

On Wed, Jun 09, 2021 at 09:39:19AM -0300, Jason Gunthorpe wrote:
> On Wed, Jun 09, 2021 at 02:24:03PM +0200, Joerg Roedel wrote:
> > On Mon, Jun 07, 2021 at 02:58:18AM +0000, Tian, Kevin wrote:
> > > - Device-centric (Jason) vs. group-centric (David) uAPI. David is not fully
> > > convinced yet. Based on discussion v2 will continue to have ioasid uAPI
> > > being device-centric (but it's fine for vfio to be group-centric). A new
> > > section will be added to elaborate this part;
> >
> > I would vote for group-centric here. Or do the reasons for which VFIO is
> > group-centric not apply to IOASID? If so, why?
>
> VFIO being group centric has made it very ugly/difficult to inject
> device driver specific knowledge into the scheme.
>
> The device driver is the only thing that knows to ask:
> - I need a SW table for this ioasid because I am like a mdev
> - I will issue TLPs with PASID
> - I need a IOASID linked to a PASID
> - I am a devices that uses ENQCMD and vPASID
> - etc in future

mdev drivers might know these, but shim drivers, like basic vfio-pci
often won't. In that case only the userspace driver will know that
for certain. The shim driver at best has a fairly loose bound on what
the userspace driver *could* do.

I still think you're having a tendency to partially conflate several
meanings of "group":
1. the unavoidable hardware unit of non-isolation
2. the kernel internal concept and interface to it
3. the user visible fd and interface

We can't avoid having (1) somewhere, (3) and to a lesser extent (2)
are what you object to.

> The current approach has the group try to guess the device driver
> intention in the vfio type 1 code.

I agree this has gotten ugly. What I'm not yet convinced of is that
reworking groups to make this not-ugly necessarily requires totally
minimizing the importance of groups.

> I want to see this be clean and have the device driver directly tell
> the iommu layer what kind of DMA it plans to do, and thus how it needs
> the IOMMU and IOASID configured.

>
> This is the source of the ugly symbol_get and the very, very hacky 'if
> you are a mdev *and* a iommu then you must want a single PASID' stuff
> in type1.
>
> The group is causing all this mess because the group knows nothing
> about what the device drivers contained in the group actually want.
>
> Further being group centric eliminates the possibility of working in
> cases like !ACS. How do I use PASID functionality of a device behind a
> !ACS switch if the uAPI forces all IOASID's to be linked to a group,
> not a device?
>
> Device centric with an report that "all devices in the group must use
> the same IOASID" covers all the new functionality, keep the old, and
> has a better chance to keep going as a uAPI into the future.
>
> Jason
>

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (3.00 kB)
signature.asc (849.00 B)
Download all attachments

2021-06-17 07:23:50

by David Gibson

[permalink] [raw]
Subject: Re: Plan for /dev/ioasid RFC v2

On Thu, Jun 10, 2021 at 01:50:22PM +0800, Lu Baolu wrote:
> On 6/9/21 8:39 PM, Jason Gunthorpe wrote:
> > On Wed, Jun 09, 2021 at 02:24:03PM +0200, Joerg Roedel wrote:
> > > On Mon, Jun 07, 2021 at 02:58:18AM +0000, Tian, Kevin wrote:
> > > > - Device-centric (Jason) vs. group-centric (David) uAPI. David is not fully
> > > > convinced yet. Based on discussion v2 will continue to have ioasid uAPI
> > > > being device-centric (but it's fine for vfio to be group-centric). A new
> > > > section will be added to elaborate this part;
> > > I would vote for group-centric here. Or do the reasons for which VFIO is
> > > group-centric not apply to IOASID? If so, why?
> > VFIO being group centric has made it very ugly/difficult to inject
> > device driver specific knowledge into the scheme.
> >
> > The device driver is the only thing that knows to ask:
> > - I need a SW table for this ioasid because I am like a mdev
> > - I will issue TLPs with PASID
> > - I need a IOASID linked to a PASID
> > - I am a devices that uses ENQCMD and vPASID
> > - etc in future
> >
> > The current approach has the group try to guess the device driver
> > intention in the vfio type 1 code.
> >
> > I want to see this be clean and have the device driver directly tell
> > the iommu layer what kind of DMA it plans to do, and thus how it needs
> > the IOMMU and IOASID configured.
> >
> > This is the source of the ugly symbol_get and the very, very hacky 'if
> > you are a mdev*and* a iommu then you must want a single PASID' stuff
> > in type1.
> >
> > The group is causing all this mess because the group knows nothing
> > about what the device drivers contained in the group actually want.
> >
> > Further being group centric eliminates the possibility of working in
> > cases like !ACS. How do I use PASID functionality of a device behind a
> > !ACS switch if the uAPI forces all IOASID's to be linked to a group,
> > not a device?
> >
> > Device centric with an report that "all devices in the group must use
> > the same IOASID" covers all the new functionality, keep the old, and
> > has a better chance to keep going as a uAPI into the future.
>
> The iommu_group can guarantee the isolation among different physical
> devices (represented by RIDs). But when it comes to sub-devices (ex. mdev or
> vDPA devices represented by RID + SSID), we have to rely on the
> device driver for isolation. The devices which are able to generate sub-
> devices should either use their own on-device mechanisms or use the
> platform features like Intel Scalable IOV to isolate the sub-devices.

This seems like a misunderstanding of groups. Groups are not tied to
any PCI meaning. Groups are the smallest unit of isolation, no matter
what is providing that isolation.

If mdevs are isolated from each other by clever software, even though
they're on the same PCI device they are in different groups from each
other *by definition*. They are also in a different group from their
parent device (however the mdevs only exist when mdev driver is
active, which implies that the parent device's group is owned by the
kernel).

> Under above conditions, different sub-device from a same RID device
> could be able to use different IOASID. This seems to means that we can't
> support mixed mode where, for example, two RIDs share an iommu_group and
> one (or both) of them have sub-devices.

That doesn't necessarily follow. mdevs which can be successfully
isolated by their mdev driver are in a different group from their
parent device, and therefore need not be affected by whether the
parent device shares a group with some other physical device. They
*might* be, but that's up to the mdev driver to determine based on
what it can safely isolate.

> AIUI, when we attach a "RID + SSID" to an IOASID, we should require that
> the RID doesn't share the iommu_group with any other RID.
>
> Best regards,
> baolu
>

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (4.10 kB)
signature.asc (849.00 B)
Download all attachments

2021-06-17 07:24:50

by David Gibson

[permalink] [raw]
Subject: Re: Plan for /dev/ioasid RFC v2

On Fri, Jun 11, 2021 at 01:45:29PM -0300, Jason Gunthorpe wrote:
> On Thu, Jun 10, 2021 at 09:38:42AM -0600, Alex Williamson wrote:
>
> > Opening the group is not the extent of the security check currently
> > required, the group must be added to a container and an IOMMU model
> > configured for the container *before* the user can get a devicefd.
> > Each devicefd creates a reference to this security context, therefore
> > access to a device does not exist without such a context.
>
> Okay, I missed that detail in the organization..
>
> So, if we have an independent vfio device fd then it needs to be
> kept disable until the user joins it to an ioasid that provides the
> security proof to allow it to work?
>
> > What happens on detach? As we've discussed elsewhere in this thread,
> > revoking access is more difficult than holding a reference to the
> > secure context, but I'm under the impression that moving a device
> > between IOASIDs could be standard practice in this new model. A device
> > that's detached from a secure context, even temporarily, is a
> > problem.
>
> This is why I think the single iommu FD is critical, it is the FD, not
> the IOASID that has to authorize the security. You shouldn't move
> devices between FDs, but you can move them between IOASIDs inside the
> same FD.
>
> > How to label a device seems like a relatively mundane issue relative to
> > ownership and isolated contexts of groups and devices. The label is
> > essentially just creating an identifier to device mapping, where the
> > identifier (label) will be used in the IOASID interface, right?
>
> It looks that way
>
> > As I note above, that makes it difficult for vfio to maintain that a
> > user only accesses a device in a secure context. This is exactly
> > why vfio has the model of getting a devicefd from a groupfd only
> > when that group is in a secure context and maintaining references to
> > that secure context for each device. Split ownership of the secure
> > context in IOASID vs device access in vfio and exposing devicefds
> > outside the group is still a big question mark for me. Thanks,
>
> I think the protection model becomes different once we allow
> individual devices inside a group to be attached to different
> IOASID's.

I'm really wary of this. They might be rare, but we still need to
consider the case of devices which can't be distinguished on the bus,
and therefore can't be attached to different IOASIDs. That means that
if we allow attaching devices within a group to different IOASIDs we
effectively need to introduce two levels of "group-like" things.
First the idenfication group, then the isolation group.


You're using "group" for the isolation group, but then we have to
somehow expose this concept of identification group. That seems like
a heap of complexity and confusion in the interface.

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (3.04 kB)
signature.asc (849.00 B)
Download all attachments

2021-06-17 07:32:41

by Tian, Kevin

[permalink] [raw]
Subject: RE: Plan for /dev/ioasid RFC v2

> From: Alex Williamson <[email protected]>
> Sent: Thursday, June 17, 2021 3:40 AM
>
> On Wed, 16 Jun 2021 06:43:23 +0000
> "Tian, Kevin" <[email protected]> wrote:
>
> > > From: Alex Williamson <[email protected]>
> > > Sent: Wednesday, June 16, 2021 12:12 AM
> > >
> > > On Tue, 15 Jun 2021 02:31:39 +0000
> > > "Tian, Kevin" <[email protected]> wrote:
> > >
> > > > > From: Alex Williamson <[email protected]>
> > > > > Sent: Tuesday, June 15, 2021 12:28 AM
> > > > >
> > > > [...]
> > > > > > IOASID. Today the group fd requires an IOASID before it hands out a
> > > > > > device_fd. With iommu_fd the device_fd will not allow IOCTLs until
> it
> > > > > > has a blocked DMA IOASID and is successefully joined to an
> iommu_fd.
> > > > >
> > > > > Which is the root of my concern. Who owns ioctls to the device fd?
> > > > > It's my understanding this is a vfio provided file descriptor and it's
> > > > > therefore vfio's responsibility. A device-level IOASID interface
> > > > > therefore requires that vfio manage the group aspect of device access.
> > > > > AFAICT, that means that device access can therefore only begin when
> all
> > > > > devices for a given group are attached to the IOASID and must halt for
> > > > > all devices in the group if any device is ever detached from an IOASID,
> > > > > even temporarily. That suggests a lot more oversight of the IOASIDs
> by
> > > > > vfio than I'd prefer.
> > > > >
> > > >
> > > > This is possibly the point that is worthy of more clarification and
> > > > alignment, as it sounds like the root of controversy here.
> > > >
> > > > I feel the goal of vfio group management is more about ownership, i.e.
> > > > all devices within a group must be assigned to a single user. Following
> > > > the three rules defined by Jason, what we really care is whether a group
> > > > of devices can be isolated from the rest of the world, i.e. no access to
> > > > memory/device outside of its security context and no access to its
> > > > security context from devices outside of this group. This can be
> achieved
> > > > as long as every device in the group is either in block-DMA state when
> > > > it's not attached to any security context or attached to an IOASID
> context
> > > > in IOMMU fd.
> > > >
> > > > As long as group-level isolation is satisfied, how devices within a group
> > > > are further managed is decided by the user (unattached, all attached to
> > > > same IOASID, attached to different IOASIDs) as long as the user
> > > > understands the implication of lacking of isolation within the group.
> This
> > > > is what a device-centric model comes to play. Misconfiguration just
> hurts
> > > > the user itself.
> > > >
> > > > If this rationale can be agreed, then I didn't see the point of having VFIO
> > > > to mandate all devices in the group must be attached/detached in
> > > > lockstep.
> > >
> > > In theory this sounds great, but there are still too many assumptions
> > > and too much hand waving about where isolation occurs for me to feel
> > > like I really have the complete picture. So let's walk through some
> > > examples. Please fill in and correct where I'm wrong.
> >
> > Thanks for putting these examples. They are helpful for clearing the
> > whole picture.
> >
> > Before filling in let's first align on what is the key difference between
> > current VFIO model and this new proposal. With this comparison we'll
> > know which of following questions are answered with existing VFIO
> > mechanism and which are handled differently.
> >
> > With Yi's help we figured out the current mechanism:
> >
> > 1) vfio_group_viable. The code comment explains the intention clearly:
> >
> > --
> > * A vfio group is viable for use by userspace if all devices are in
> > * one of the following states:
> > * - driver-less
> > * - bound to a vfio driver
> > * - bound to an otherwise allowed driver
> > * - a PCI interconnect device
> > --
> >
> > Note this check is not related to an IOMMU security context.
>
> Because this is a pre-requisite for imposing that IOMMU security
> context.
>
> > 2) vfio_iommu_group_notifier. When an IOMMU_GROUP_NOTIFY_
> > BOUND_DRIVER event is notified, vfio_group_viable is re-evaluated.
> > If the affected group was previously viable but now becomes not
> > viable, BUG_ON() as it implies that this device is bound to a non-vfio
> > driver which breaks the group isolation.
>
> This notifier action is conditional on there being users of devices
> within a secure group IOMMU context.
>
> > 3) vfio_group_get_device_fd. User can acquire a device fd only after
> > a) the group is viable;
> > b) the group is attached to a container;
> > c) iommu is set on the container (implying a security context
> > established);
>
> The order is actually b) a) c) but arguably b) is a no-op until:
>
> d) a device fd is provided to the user
>
> > The new device-centric proposal suggests:
> >
> > 1) vfio_group_viable;
> > 2) vfio_iommu_group_notifier;
> > 3) block-DMA if a device is detached from previous domain (instead of
> > switching back to default domain as today);
>
> I'm literally begging for specifics in this thread, but none are
> provided here. What is the "previous domain"? How is a device placed
> into a DMA blocking IOMMU context? Is this the IOMMU default domain?
> Doesn't that represent a change in IOMMU behavior to place devices into
> a blocking DMA context in several of the group-viable scenarios?

Yes, it represents a change in current IOMMU behavior. Here I just
described what would be the desired logic in concept.

More specifically, the current IOMMU behavior is that:

- A device is attached to the default domain (identity or dma) when it's
probed by the iommu driver. If the domain type is dma, iommu
isolation is enabled with an empty I/O page table thus the device
DMA is blocked. If the domain type is identity, iommu isolation is
disabled thus the device can access arbitrary memory/device even
when it's not bound to any driver.

- Once the device is bound to a driver which doesn't allocate a new
domain, the default domain allows the driver to do DMA API on
the device. Unbound from the driver doesn't change device/domain
attaching status i.e. the device is still attached to the default domain.
Whether the device can access certain memory locations after unbind
depends on whether the driver clears up its mappings properly.

- Now the device is bound to a driver (vfio) which manages its own
security context (domain type is unmanaged). The device stays
attaching to the default domain before the driver explicitly switches
it to use the new unmanaged domain. Detaching the device from an
unmanaged domain later puts it back to use the default domain.

Then the current vfio mechanism makes sense because there is no
guarantee that the default domain can isolate the device from the
rest system (if domain type is identity or dma but previous driver
leaves stale mappings due to some bug). vfio has to allow user access
only after all devices in the group are switched to a known security
context that is created by vfio itself.

Now let's talk about the new IOMMU behavior:

- A device is blocked from doing DMA to any resource outside of
its group when it's probed by the IOMMU driver. This could be a
special state w/o attaching to any domain, or a new special domain
type which differentiates it from existing domain types (identity,
dma, or unmanged). Actually existing code already includes a
IOMMU_DOMAIN_BLOCKED type but nobody uses it.

- Once the device is bound to a driver which doesn't allocate a new
domain, the first DMA API call implicitly switches the device from
block-DMA state to use the existing default domain (identity or
dma). This change should be easy as current code already supports
a deferred attach mode which is activated in kdump kernel. Unbound
from the driver implicitly detaches the device from the default
domain and switches it back to the block-DMA state. This can be
enforced via iommu_bus_notifier().

- Now the device is bound to a driver (vfio) which delegates management
of security context to iommu fd. The device stays in the block-DMA
state before its attached to an IOASID. After IOASID attaching, it is put
in a new security context represented by the IOASID. Detaching the
device from an IOASID puts it back to block-DMA.

With this new behavior vfio just needs to track that all devices are in
block-DMA state before user access is allowed. This can be reported
via a new iommu interface and checked in vfio_group_viable() in
addition to what it verifies today. Once a group is viable, the user can
get a device fd from the group and bind it to iommu fd. vfio doesn't
need wait for all devices in the group attached to the same IOASID
before granting user access, because they are all isolated from the
rest system being either in block-DMA or in a new security context.
Thus a device-centric interface between vfio and iommu fd should
be sufficient.

vfio_iommu_group_notifier will be slightly changed to check whether
any device within the group is bound to iommu fd. If yes BUG_ON
is raised to avoid breaking the group isolation. To be consistent
VFIO_BIND_IOMMU_FD need also check group viability.

>
> > 4) vfio_group_get_device_fd. User can acquire a device fd once the group
> > is viable;
>
> But as you've noted, "viable" doesn't test the IOMMU context of the
> group devices, it's only a pre-condition for attaching the group to an
> IOMMU context for isolated access. What changes in the kernel that
> makes "viable" become "isolated"? A device bound to pci-stub today is
> certainly not in a DMA blocking context when the host is booted with
> iommu=pt. Enabling the IOMMU only for device assignment by using
> iommu=pt is arguably the predominant use case of the IOMMU.

vfio_group_viable() needs to check block-DMA, and iommu=pt
only affects the DMA API path now. A device which is not bound
to any driver or the driver doesn't do DMA on is left in block-DMA
state.

>
> > 5) device-centric when binding to IOMMU fd or attaching to IOASID
> >
> > In this model the group viability mechanism is kept but there is no need
> > for VFIO to track the actual attaching status.
> >
> > Now let's look at how the new model works.
> >
> > >
> > > 1) A dual-function PCIe e1000e NIC where the functions are grouped
> > > together due to ACS isolation issues.
> > >
> > > a) Initial state: functions 0 & 1 are both bound to e1000e driver.
> > >
> > > b) Admin uses driverctl to bind function 1 to vfio-pci, creating
> > > vfio device file, which is chmod'd to grant to a user.
> >
> > This implies that function 1 is in block-DMA mode when it's unbound
> > from e1000e.
>
> Does this require a kernel change from current? Does it require the
> host is not in iommu=pt mode? Did vfio or vfio-pci do anything to
> impose this DMA blocking context? What if function 1 is actually a DMA

I hope above explanation answers them.

> alias of function 0, wouldn't changing function 1's IOMMU context break
> the operation of function 0?

Sorry I'm not familiar with this DMA aliasing thing. Can you elaborate?

>
> > >
> > > c) User opens vfio function 1 device file and an iommu_fd, binds
> > > device_fd to iommu_fd.
> >
> > User should check group viability before step c).
>
> Sure, but "user should" is not a viable security model.
>
> > >
> > > Does this succeed?
> > > - if no, specifically where does it fail?
> > > - if yes, vfio can now allow access to the device?
> > >
> >
> > with group viability step c) fails.
>
> I'm asking for specifics, is it vfio's responsibility to test viability
> before trying to bind the device_fd to the iommu_fd and it's vfio that
> triggers this failure? This sounds like vfio is entirely responsible
> for managing the integrity of the group.

manage integrity of the group based on block-DMA, but no need of
a group interface with iommu fd to track group attaching status.

>
> > > d) Repeat b) for function 0.
> >
> > function 0 is in block DMA mode now.
>
> Somehow...
>
> > >
> > > e) Repeat c), still using function 1, is it different? Where? Why?
> >
> > it's different because group becomes viable now. Then step c) succeeds.
> > At this point, both function 0/1 are in block-DMA mode thus isolated
> > from the rest of the system. VFIO allows the user to access function 1
> > without the need of knowing when function 1 is attached to a new
> > context (IOASID) via IOMMU fd and whether function 0 is left detached.
> >
> > >
> > > 2) The same NIC as 1)
> > >
> > > a) Initial state: functions 0 & 1 bound to vfio-pci, vfio device
> > > files granted to user, user has bound both device_fds to the same
> > > iommu_fd.
> > >
> > > AIUI, even though not bound to an IOASID, vfio can now enable access
> > > through the device_fds, right? What specific entity has placed these
> >
> > yes
> >
> > > devices into a block DMA state, when, and how?
> >
> > As explained in 2.b), both devices are put into block-DMA when they
> > are detached from the default domain which is used when they are
> > bound to e1000e driver.
>
> How do stub drivers interact with this model? How do PCI interconnect
> drivers work with this model? How do DMA alias devices work with this
> model? How does iommu=pt work with this model? Does vfio just
> passively assume the DMA blocking IOMMU context based on other random
> attributes of the device?

pci-stub and bridge drivers follow the existing viability check.

iommu=pt has no impact on block-DMA.

vfio explicitly tracks the dma-blocking state, which doesn't rely on iommu fd.

but I haven't got time to think about DMA aliasing yet.

>
> > >
> > > b) Both devices are attached to the same IOASID.
> > >
> > > Are we assuming that each device was atomically moved to the new
> > > IOMMU context by the IOASID code? What if the IOMMU cannot
> change
> > > the domain atomically?
> >
> > No. Moving function 0 then function 1, or moving function 0 alone can
> > all works. The one which hasn't been attached to an IOASID is kept in
> > block-DMA state.
>
> I'm asking whether this can be accomplished atomically relative to
> device DMA. If the user has access to the device after the bind
> operation and the device operates in a DMA blocking IOMMU context at
> that point, it seems that every IOASID context switch must be atomic
> relative to device DMA or we present an exploitable gap to the user.

the switch is always between block-DMA and a driver-created domain
which are both secure. Does this assumption meet the 'atomic' behavior
in your mind?

>
> This is another change from vfio, the lifetime of the IOMMU context
> encompasses the lifetime of device access.
>
> > >
> > > c) The device_fd for function 1 is detached from the IOASID.
> > >
> > > Are we assuming the reverse of b) performed by the IOASID code?
> >
> > function 1 turns back to block-DMA
> >
> > >
> > > d) The device_fd for function 1 is unbound from the iommu_fd.
> > >
> > > Does this succeed?
> > > - if yes, what is the resulting IOMMU context of the device and
> > > who owns it?
> > > - if no, well, that results in numerous tear-down issues.
> >
> > Yes. function 1 is block-DMA while function 0 still attached to IOASID.
> > Actually unbind from IOMMU fd doesn't change the security context.
> > the change is conducted when attaching/detaching device to/from an
> > IOASID.
>
> But I think you're suggesting that the IOMMU context is simply the
> device's default domain, so vfio is left in the position where the user
> gained access to the device by binding it to an iommu_fd, but now the
> device exists outside of the iommu_fd. Doesn't that make it pointless
> to gate device access on binding the device to the iommu_fd? The user
> can get an accessible device_fd unbound from an iommu_fd on the reverse
> path.

yes, binding to iommu_fd is not the appropriate point of gating
device access.

>
> That would mean vfio's only control point for device access is on
> open().

yes, on open() via block-DMA check in vfio_group_viable().

>
> > >
> > > e) Function 1 is unbound from vfio-pci.
> > >
> > > Does this work or is it blocked? If blocked, by what entity
> > > specifically?
> >
> > works.
> >
> > >
> > > f) Function 1 is bound to e1000e driver.
> > >
> > > We clearly have a violation here, specifically where and by who in
> > > this path should have prevented us from getting here or who pushes
> > > the BUG_ON to abort this?
> >
> > via vfio_iommu_group_notifier, same as today.
>
> So as above, group integrity remains entirely vfio's issue? Didn't we

sort of...

> discuss elsewhere in this thread that unless group integrity is managed
> by /dev/iommu that we're going to have a mess of different consumers
> managing it different degrees and effectiveness (or more likely just
> ignoring it)?

Yes, that was the original impression. But after figuring out the new
block-DMA behavior, I'm not sure whether /dev/iommu must maintain
its own group integrity check. If it trusts vfio, I feel it's fine to avoid
such check which even allows a group of devices bound to different
IOMMU fd's if user likes. Also if we want to sustain the current vfio
semantics which doesn't require all devices in the group bound to
vfio driver, seems it's pointless to enforce such integrity check in
/dev/iommu.

Jason, what's your opinion?

>
> > >
> > > 3) A dual-function conventional PCI e1000 NIC where the functions are
> > > grouped together due to shared RID.
> > >
> > > a) Repeat 2.a) and 2.b) such that we have a valid, user accessible
> > > devices in the same IOMMU context.
> > >
> > > b) Function 1 is detached from the IOASID.
> > >
> > > I think function 1 cannot be placed into a different IOMMU context
> > > here, does the detach work? What's the IOMMU context now?
> >
> > Yes. Function 1 is back to block-DMA. Since both functions share RID,
> > essentially it implies function 0 is in block-DMA state too (though its
> > tracking state may not change yet) since the shared IOMMU context
> > entry blocks DMA now. In IOMMU fd function 0 is still attached to the
> > IOASID thus the user still needs do an explicit detach to clear the
> > tracking state for function 0.
> >
> > >
> > > c) A new IOASID is alloc'd within the existing iommu_fd and function
> > > 1 is attached to the new IOASID.
> > >
> > > Where, how, by whom does this fail?
> >
> > No need to fail. It can succeed since doing so just hurts user's own foot.
> >
> > The only question is how user knows the fact that a group of devices
> > share RID thus avoid such thing. I'm curious how it is communicated
> > with today's VFIO mechanism. Yes the group-centric VFIO uAPI prevents
> > a group of devices from attaching to multiple IOMMU contexts, but
> > suppose we still need a way to tell the user to not do so. Especially
> > such knowledge would be also reflected in the virtual PCI topology
> > when the entire group is assigned to the guest which needs to know
> > this fact when vIOMMU is exposed. I haven't found time to investigate
> > it but suppose if such channel exists it could be reused, or in the worst
> > case we may have the new device capability interface to convey...
>
> No such channel currently exists, it's not an issue today, IOMMU
> context is group-based.

Interesting... If such group of devices are assigned to a guest, how does
Qemu decide the virtual PCI topology for them? Do they have same
vRID or different?

>
> > > If vfio gets to offload all of it's group management to IOASID code,
> > > that's great, but I'm afraid that IOASID is so focused on a
> > > device-level API that we're instead just ignoring the group dynamics
> > > and vfio will be forced to provide oversight to maintain secure
> > > userspace access. Thanks,
> > >
> >
> > In summary, the security of the group dynamics are handled through
> > block-DMA plus existing vfio_group_viable mechanism in this device-
> > centric design. VFIO still keeps its group management, but no need
> > to track the attaching status for allowing user access.
>
> Still seems pretty loosely defined to me, the DMA blocking mechanism

Sorry for that and hope the explanation in this mail makes it clearer.

> isn't specified, there's no verification of the IOMMU context for
> "stray" group devices, the group management is based in the IOASID
> consumer code leading to varying degrees of implementation and
> effectiveness between callers, we lean more heavily on a fragile
> notifier to notice and hit the panic button on violation.
>
> It would make a lot more sense to me if the model were for vfio to
> bind groups to /dev/iommu, the IOASID code manages group integrity, and
> devices can still be moved between IOASIDs as is the overall goal. The
> group is the basis of ownership, which makes it a worthwhile part of
> the API. Thanks,
>

Having explained the device-centric design, honestly speaking I think
your model could also work. Group is an iommu concept, thus not
unsound by asking /dev/iommu to manage the group integrity, e.g.
moving the block-DMA verification and vfio_group_viable() into
/dev/iommu and verify it when doing group binding. A successful
group binding implies all group verification passed then user access
can be allowed. But even doing so I don't expect /dev/iommu uAPI
will include any explicit group semantics. Just the in-kernel helper
functions accepts group via VFIO_GROUP_BIND_IOMMU_FD.

Thanks
Kevin

2021-06-17 21:35:54

by Alex Williamson

[permalink] [raw]
Subject: Re: Plan for /dev/ioasid RFC v2

On Thu, 17 Jun 2021 07:31:03 +0000
"Tian, Kevin" <[email protected]> wrote:

> > From: Alex Williamson <[email protected]>
> > Sent: Thursday, June 17, 2021 3:40 AM
> >
> > On Wed, 16 Jun 2021 06:43:23 +0000
> > "Tian, Kevin" <[email protected]> wrote:
> >
> > > > From: Alex Williamson <[email protected]>
> > > > Sent: Wednesday, June 16, 2021 12:12 AM
> > > >
> > > > On Tue, 15 Jun 2021 02:31:39 +0000
> > > > "Tian, Kevin" <[email protected]> wrote:
> > > >
> > > > > > From: Alex Williamson <[email protected]>
> > > > > > Sent: Tuesday, June 15, 2021 12:28 AM
> > > > > >
> > > > > [...]
> > > > > > > IOASID. Today the group fd requires an IOASID before it hands out a
> > > > > > > device_fd. With iommu_fd the device_fd will not allow IOCTLs until
> > it
> > > > > > > has a blocked DMA IOASID and is successefully joined to an
> > iommu_fd.
> > > > > >
> > > > > > Which is the root of my concern. Who owns ioctls to the device fd?
> > > > > > It's my understanding this is a vfio provided file descriptor and it's
> > > > > > therefore vfio's responsibility. A device-level IOASID interface
> > > > > > therefore requires that vfio manage the group aspect of device access.
> > > > > > AFAICT, that means that device access can therefore only begin when
> > all
> > > > > > devices for a given group are attached to the IOASID and must halt for
> > > > > > all devices in the group if any device is ever detached from an IOASID,
> > > > > > even temporarily. That suggests a lot more oversight of the IOASIDs
> > by
> > > > > > vfio than I'd prefer.
> > > > > >
> > > > >
> > > > > This is possibly the point that is worthy of more clarification and
> > > > > alignment, as it sounds like the root of controversy here.
> > > > >
> > > > > I feel the goal of vfio group management is more about ownership, i.e.
> > > > > all devices within a group must be assigned to a single user. Following
> > > > > the three rules defined by Jason, what we really care is whether a group
> > > > > of devices can be isolated from the rest of the world, i.e. no access to
> > > > > memory/device outside of its security context and no access to its
> > > > > security context from devices outside of this group. This can be
> > achieved
> > > > > as long as every device in the group is either in block-DMA state when
> > > > > it's not attached to any security context or attached to an IOASID
> > context
> > > > > in IOMMU fd.
> > > > >
> > > > > As long as group-level isolation is satisfied, how devices within a group
> > > > > are further managed is decided by the user (unattached, all attached to
> > > > > same IOASID, attached to different IOASIDs) as long as the user
> > > > > understands the implication of lacking of isolation within the group.
> > This
> > > > > is what a device-centric model comes to play. Misconfiguration just
> > hurts
> > > > > the user itself.
> > > > >
> > > > > If this rationale can be agreed, then I didn't see the point of having VFIO
> > > > > to mandate all devices in the group must be attached/detached in
> > > > > lockstep.
> > > >
> > > > In theory this sounds great, but there are still too many assumptions
> > > > and too much hand waving about where isolation occurs for me to feel
> > > > like I really have the complete picture. So let's walk through some
> > > > examples. Please fill in and correct where I'm wrong.
> > >
> > > Thanks for putting these examples. They are helpful for clearing the
> > > whole picture.
> > >
> > > Before filling in let's first align on what is the key difference between
> > > current VFIO model and this new proposal. With this comparison we'll
> > > know which of following questions are answered with existing VFIO
> > > mechanism and which are handled differently.
> > >
> > > With Yi's help we figured out the current mechanism:
> > >
> > > 1) vfio_group_viable. The code comment explains the intention clearly:
> > >
> > > --
> > > * A vfio group is viable for use by userspace if all devices are in
> > > * one of the following states:
> > > * - driver-less
> > > * - bound to a vfio driver
> > > * - bound to an otherwise allowed driver
> > > * - a PCI interconnect device
> > > --
> > >
> > > Note this check is not related to an IOMMU security context.
> >
> > Because this is a pre-requisite for imposing that IOMMU security
> > context.
> >
> > > 2) vfio_iommu_group_notifier. When an IOMMU_GROUP_NOTIFY_
> > > BOUND_DRIVER event is notified, vfio_group_viable is re-evaluated.
> > > If the affected group was previously viable but now becomes not
> > > viable, BUG_ON() as it implies that this device is bound to a non-vfio
> > > driver which breaks the group isolation.
> >
> > This notifier action is conditional on there being users of devices
> > within a secure group IOMMU context.
> >
> > > 3) vfio_group_get_device_fd. User can acquire a device fd only after
> > > a) the group is viable;
> > > b) the group is attached to a container;
> > > c) iommu is set on the container (implying a security context
> > > established);
> >
> > The order is actually b) a) c) but arguably b) is a no-op until:
> >
> > d) a device fd is provided to the user
> >
> > > The new device-centric proposal suggests:
> > >
> > > 1) vfio_group_viable;
> > > 2) vfio_iommu_group_notifier;
> > > 3) block-DMA if a device is detached from previous domain (instead of
> > > switching back to default domain as today);
> >
> > I'm literally begging for specifics in this thread, but none are
> > provided here. What is the "previous domain"? How is a device placed
> > into a DMA blocking IOMMU context? Is this the IOMMU default domain?
> > Doesn't that represent a change in IOMMU behavior to place devices into
> > a blocking DMA context in several of the group-viable scenarios?
>
> Yes, it represents a change in current IOMMU behavior. Here I just
> described what would be the desired logic in concept.
>
> More specifically, the current IOMMU behavior is that:
>
> - A device is attached to the default domain (identity or dma) when it's
> probed by the iommu driver. If the domain type is dma, iommu
> isolation is enabled with an empty I/O page table thus the device
> DMA is blocked. If the domain type is identity, iommu isolation is
> disabled thus the device can access arbitrary memory/device even
> when it's not bound to any driver.
>
> - Once the device is bound to a driver which doesn't allocate a new
> domain, the default domain allows the driver to do DMA API on
> the device. Unbound from the driver doesn't change device/domain
> attaching status i.e. the device is still attached to the default domain.
> Whether the device can access certain memory locations after unbind
> depends on whether the driver clears up its mappings properly.
>
> - Now the device is bound to a driver (vfio) which manages its own
> security context (domain type is unmanaged). The device stays
> attaching to the default domain before the driver explicitly switches
> it to use the new unmanaged domain. Detaching the device from an
> unmanaged domain later puts it back to use the default domain.
>
> Then the current vfio mechanism makes sense because there is no
> guarantee that the default domain can isolate the device from the
> rest system (if domain type is identity or dma but previous driver
> leaves stale mappings due to some bug). vfio has to allow user access
> only after all devices in the group are switched to a known security
> context that is created by vfio itself.
>
> Now let's talk about the new IOMMU behavior:
>
> - A device is blocked from doing DMA to any resource outside of
> its group when it's probed by the IOMMU driver. This could be a
> special state w/o attaching to any domain, or a new special domain
> type which differentiates it from existing domain types (identity,
> dma, or unmanged). Actually existing code already includes a
> IOMMU_DOMAIN_BLOCKED type but nobody uses it.
>
> - Once the device is bound to a driver which doesn't allocate a new
> domain, the first DMA API call implicitly switches the device from
> block-DMA state to use the existing default domain (identity or
> dma). This change should be easy as current code already supports
> a deferred attach mode which is activated in kdump kernel. Unbound
> from the driver implicitly detaches the device from the default
> domain and switches it back to the block-DMA state. This can be
> enforced via iommu_bus_notifier().
>
> - Now the device is bound to a driver (vfio) which delegates management
> of security context to iommu fd. The device stays in the block-DMA
> state before its attached to an IOASID. After IOASID attaching, it is put
> in a new security context represented by the IOASID. Detaching the
> device from an IOASID puts it back to block-DMA.
>
> With this new behavior vfio just needs to track that all devices are in
> block-DMA state before user access is allowed. This can be reported
> via a new iommu interface and checked in vfio_group_viable() in
> addition to what it verifies today. Once a group is viable, the user can
> get a device fd from the group and bind it to iommu fd. vfio doesn't
> need wait for all devices in the group attached to the same IOASID
> before granting user access, because they are all isolated from the
> rest system being either in block-DMA or in a new security context.
> Thus a device-centric interface between vfio and iommu fd should
> be sufficient.

Thanks for the additional detail.

> vfio_iommu_group_notifier will be slightly changed to check whether
> any device within the group is bound to iommu fd. If yes BUG_ON
> is raised to avoid breaking the group isolation. To be consistent
> VFIO_BIND_IOMMU_FD need also check group viability.
>
> >
> > > 4) vfio_group_get_device_fd. User can acquire a device fd once the group
> > > is viable;
> >
> > But as you've noted, "viable" doesn't test the IOMMU context of the
> > group devices, it's only a pre-condition for attaching the group to an
> > IOMMU context for isolated access. What changes in the kernel that
> > makes "viable" become "isolated"? A device bound to pci-stub today is
> > certainly not in a DMA blocking context when the host is booted with
> > iommu=pt. Enabling the IOMMU only for device assignment by using
> > iommu=pt is arguably the predominant use case of the IOMMU.
>
> vfio_group_viable() needs to check block-DMA, and iommu=pt
> only affects the DMA API path now. A device which is not bound
> to any driver or the driver doesn't do DMA on is left in block-DMA
> state.

I think you're going to get friction from any remaining non-vfio
userspace drivers that rely on a passthrough default domain. Any such
drivers are clearly insecure, and in the case of uio-pci-generic
violate the intended non-DMA use case of the driver, but I think they
still exist. This effectively flips the switch that those drivers can
no longer work in an IOMMU enabled environment.

> > > 5) device-centric when binding to IOMMU fd or attaching to IOASID
> > >
> > > In this model the group viability mechanism is kept but there is no need
> > > for VFIO to track the actual attaching status.
> > >
> > > Now let's look at how the new model works.
> > >
> > > >
> > > > 1) A dual-function PCIe e1000e NIC where the functions are grouped
> > > > together due to ACS isolation issues.
> > > >
> > > > a) Initial state: functions 0 & 1 are both bound to e1000e driver.
> > > >
> > > > b) Admin uses driverctl to bind function 1 to vfio-pci, creating
> > > > vfio device file, which is chmod'd to grant to a user.
> > >
> > > This implies that function 1 is in block-DMA mode when it's unbound
> > > from e1000e.
> >
> > Does this require a kernel change from current? Does it require the
> > host is not in iommu=pt mode? Did vfio or vfio-pci do anything to
> > impose this DMA blocking context? What if function 1 is actually a DMA
>
> I hope above explanation answers them.
>
> > alias of function 0, wouldn't changing function 1's IOMMU context break
> > the operation of function 0?
>
> Sorry I'm not familiar with this DMA aliasing thing. Can you elaborate?

Look for users of pci_add_dma_alias(). These are often multi-function
devices where DMA uses the wrong RID, for instance a driver using
function 0 generates TLPs as function 1, or vice versa. There are
various other combinations of similar, but it's also used for a couple
non-transparent bridges. Effectively the IOMMU must map all aliases of
a device. Sometimes those aliases are physical devices within the same
IOMMU group, sometimes there's no struct device at the alias address.
It's a flavor of the addressibility issue we have with conventional PCI
devices, but on PCIe.

In the theoretical case above, we actually can't manipulate function 1
IOMMU mappings separate from function 0 if they were aliases. There's
a "userspace can shoot themselves in the foot" aspect to this, but there
should also be a way that userspace can understand these dependencies.

> > > >
> > > > c) User opens vfio function 1 device file and an iommu_fd, binds
> > > > device_fd to iommu_fd.
> > >
> > > User should check group viability before step c).
> >
> > Sure, but "user should" is not a viable security model.
> >
> > > >
> > > > Does this succeed?
> > > > - if no, specifically where does it fail?
> > > > - if yes, vfio can now allow access to the device?
> > > >
> > >
> > > with group viability step c) fails.
> >
> > I'm asking for specifics, is it vfio's responsibility to test viability
> > before trying to bind the device_fd to the iommu_fd and it's vfio that
> > triggers this failure? This sounds like vfio is entirely responsible
> > for managing the integrity of the group.
>
> manage integrity of the group based on block-DMA, but no need of
> a group interface with iommu fd to track group attaching status.
>
> >
> > > > d) Repeat b) for function 0.
> > >
> > > function 0 is in block DMA mode now.
> >
> > Somehow...
> >
> > > >
> > > > e) Repeat c), still using function 1, is it different? Where? Why?
> > >
> > > it's different because group becomes viable now. Then step c) succeeds.
> > > At this point, both function 0/1 are in block-DMA mode thus isolated
> > > from the rest of the system. VFIO allows the user to access function 1
> > > without the need of knowing when function 1 is attached to a new
> > > context (IOASID) via IOMMU fd and whether function 0 is left detached.
> > >
> > > >
> > > > 2) The same NIC as 1)
> > > >
> > > > a) Initial state: functions 0 & 1 bound to vfio-pci, vfio device
> > > > files granted to user, user has bound both device_fds to the same
> > > > iommu_fd.
> > > >
> > > > AIUI, even though not bound to an IOASID, vfio can now enable access
> > > > through the device_fds, right? What specific entity has placed these
> > >
> > > yes
> > >
> > > > devices into a block DMA state, when, and how?
> > >
> > > As explained in 2.b), both devices are put into block-DMA when they
> > > are detached from the default domain which is used when they are
> > > bound to e1000e driver.
> >
> > How do stub drivers interact with this model? How do PCI interconnect
> > drivers work with this model? How do DMA alias devices work with this
> > model? How does iommu=pt work with this model? Does vfio just
> > passively assume the DMA blocking IOMMU context based on other random
> > attributes of the device?
>
> pci-stub and bridge drivers follow the existing viability check.
>
> iommu=pt has no impact on block-DMA.
>
> vfio explicitly tracks the dma-blocking state, which doesn't rely on iommu fd.
>
> but I haven't got time to think about DMA aliasing yet.
>
> >
> > > >
> > > > b) Both devices are attached to the same IOASID.
> > > >
> > > > Are we assuming that each device was atomically moved to the new
> > > > IOMMU context by the IOASID code? What if the IOMMU cannot
> > change
> > > > the domain atomically?
> > >
> > > No. Moving function 0 then function 1, or moving function 0 alone can
> > > all works. The one which hasn't been attached to an IOASID is kept in
> > > block-DMA state.
> >
> > I'm asking whether this can be accomplished atomically relative to
> > device DMA. If the user has access to the device after the bind
> > operation and the device operates in a DMA blocking IOMMU context at
> > that point, it seems that every IOASID context switch must be atomic
> > relative to device DMA or we present an exploitable gap to the user.
>
> the switch is always between block-DMA and a driver-created domain
> which are both secure. Does this assumption meet the 'atomic' behavior
> in your mind?

It's not whether the target domains are secure, it's the fact that
we're now allowing userspace access to a device AND the user can
arbitrarily create new IOASID contexts and switch the device between
them while retaining that access. That's not possible with vfio and
implies that either the DMA context of the device must remain secure
during the switch or we need to determine a means to revoke device
access during the switch. For example, if there was any gap in DMA
isolation of the device during the context switch while the user has
access to the device, then the user could exploit that by repeatedly
switching a device between IOASIDs.

I want to make sure vfio doesn't need to be involved in IOASID changes.

> > This is another change from vfio, the lifetime of the IOMMU context
> > encompasses the lifetime of device access.
> >
> > > >
> > > > c) The device_fd for function 1 is detached from the IOASID.
> > > >
> > > > Are we assuming the reverse of b) performed by the IOASID code?
> > >
> > > function 1 turns back to block-DMA
> > >
> > > >
> > > > d) The device_fd for function 1 is unbound from the iommu_fd.
> > > >
> > > > Does this succeed?
> > > > - if yes, what is the resulting IOMMU context of the device and
> > > > who owns it?
> > > > - if no, well, that results in numerous tear-down issues.
> > >
> > > Yes. function 1 is block-DMA while function 0 still attached to IOASID.
> > > Actually unbind from IOMMU fd doesn't change the security context.
> > > the change is conducted when attaching/detaching device to/from an
> > > IOASID.
> >
> > But I think you're suggesting that the IOMMU context is simply the
> > device's default domain, so vfio is left in the position where the user
> > gained access to the device by binding it to an iommu_fd, but now the
> > device exists outside of the iommu_fd. Doesn't that make it pointless
> > to gate device access on binding the device to the iommu_fd? The user
> > can get an accessible device_fd unbound from an iommu_fd on the reverse
> > path.
>
> yes, binding to iommu_fd is not the appropriate point of gating
> device access.
>
> >
> > That would mean vfio's only control point for device access is on
> > open().
>
> yes, on open() via block-DMA check in vfio_group_viable().

Let's explore that. DeviceA, DeviceB, and DeviceC are grouped together,
vfio gets an open() call on DeviceA, it passes the group viable and DMA
blocked check, the user now has a device_fd with full device access
(within a DMA blocking IOMMU context). vfio now gets an open() call on
DeviceB... Is it a userspace problem with ACLs on the device files that
the next open could come from a different user?

When the user(s) start binding these device_fds to an iommu_fd, is it
vfio's responsibility to make sure all devices within the group bind to
the same iommu_fd? I think that suggests internal serialization per
group to ioasid binding and a reference per group to the iommu_fd.

The user has now bound DeviceA to an iommu_fd and attached it to an
IOASID, we then get an open() for DeviceC. Does the DMA-blocking
domain check skip DeviceA because it's already attached to an iommu_fd?

I think we also have some pathological cases around DMA aliases and
conventional PCI, for instance DeviceY is a DMA alias of DeviceX, the
user gets a device_fd for DeviceX, binds it to an iommu_fd and attaches
DeviceX to an IOASID... this triggers the IOMMU notifier because the
DMA blocking state of DeviceY has changed and triggers a BUG_ON. Maybe
a niche case, but this particular one would be exploitable by a user
and it's not entirely clear what safeguards and proper sequence by the
user would prevent it.

> > > >
> > > > e) Function 1 is unbound from vfio-pci.
> > > >
> > > > Does this work or is it blocked? If blocked, by what entity
> > > > specifically?
> > >
> > > works.
> > >
> > > >
> > > > f) Function 1 is bound to e1000e driver.
> > > >
> > > > We clearly have a violation here, specifically where and by who in
> > > > this path should have prevented us from getting here or who pushes
> > > > the BUG_ON to abort this?
> > >
> > > via vfio_iommu_group_notifier, same as today.
> >
> > So as above, group integrity remains entirely vfio's issue? Didn't we
>
> sort of...
>
> > discuss elsewhere in this thread that unless group integrity is managed
> > by /dev/iommu that we're going to have a mess of different consumers
> > managing it different degrees and effectiveness (or more likely just
> > ignoring it)?
>
> Yes, that was the original impression. But after figuring out the new
> block-DMA behavior, I'm not sure whether /dev/iommu must maintain
> its own group integrity check. If it trusts vfio, I feel it's fine to avoid
> such check which even allows a group of devices bound to different
> IOMMU fd's if user likes. Also if we want to sustain the current vfio
> semantics which doesn't require all devices in the group bound to
> vfio driver, seems it's pointless to enforce such integrity check in
> /dev/iommu.

"even allows a group of devices bound to different IOMMU fd's if user
likes", here lies madness. This is exactly why vfio uses the group as
the unit of ownership. To place the entire burden of group isolation
on userspace is essentially the same as removing any concept of group
isolation. This instantly leads to VMs sharing resources they
shouldn't, bizarre address space issues, and exploits through devices
between VMs. I think the kernel would be neglecting its duty to manage
and isolate resources for userspace if such were allowed.

Besides, how would the DMA blocking check pass if other devices in the
group were attached to a random non-blocking domain, ie. another user's
IOASID?

Whether isolation enforcement happens in vfio or /dev/iommu shouldn't
be a question of whether vfio is trusted, it should be a question of
whether vfio's use case is sufficiently unique that other users of the
IOASID infrastructure wouldn't need to reproduce the same security
measures.

> Jason, what's your opinion?
>
> >
> > > >
> > > > 3) A dual-function conventional PCI e1000 NIC where the functions are
> > > > grouped together due to shared RID.
> > > >
> > > > a) Repeat 2.a) and 2.b) such that we have a valid, user accessible
> > > > devices in the same IOMMU context.
> > > >
> > > > b) Function 1 is detached from the IOASID.
> > > >
> > > > I think function 1 cannot be placed into a different IOMMU context
> > > > here, does the detach work? What's the IOMMU context now?
> > >
> > > Yes. Function 1 is back to block-DMA. Since both functions share RID,
> > > essentially it implies function 0 is in block-DMA state too (though its
> > > tracking state may not change yet) since the shared IOMMU context
> > > entry blocks DMA now. In IOMMU fd function 0 is still attached to the
> > > IOASID thus the user still needs do an explicit detach to clear the
> > > tracking state for function 0.
> > >
> > > >
> > > > c) A new IOASID is alloc'd within the existing iommu_fd and function
> > > > 1 is attached to the new IOASID.
> > > >
> > > > Where, how, by whom does this fail?
> > >
> > > No need to fail. It can succeed since doing so just hurts user's own foot.
> > >
> > > The only question is how user knows the fact that a group of devices
> > > share RID thus avoid such thing. I'm curious how it is communicated
> > > with today's VFIO mechanism. Yes the group-centric VFIO uAPI prevents
> > > a group of devices from attaching to multiple IOMMU contexts, but
> > > suppose we still need a way to tell the user to not do so. Especially
> > > such knowledge would be also reflected in the virtual PCI topology
> > > when the entire group is assigned to the guest which needs to know
> > > this fact when vIOMMU is exposed. I haven't found time to investigate
> > > it but suppose if such channel exists it could be reused, or in the worst
> > > case we may have the new device capability interface to convey...
> >
> > No such channel currently exists, it's not an issue today, IOMMU
> > context is group-based.
>
> Interesting... If such group of devices are assigned to a guest, how does
> Qemu decide the virtual PCI topology for them? Do they have same
> vRID or different?

That's the beauty of it, it doesn't matter how many RIDs exist in the
group, or which devices have aliases, the group is the minimum
granularity of a container where QEMU knows that a container provides
a single address space. Therefore a container must exist in a single
address space in the PCI topology. In a conventional or non-vIOMMU
topology, the PCI address space is equivalent to the system memory
address space. When vIOMMU gets involved, multiple devices within the
same group must exist in the same address space. A vPCIe-to-PCI bridge
can be used to create that shared address space.

I've referred to this as a limitation of type1, that we can't put
devices within the same group into different address spaces, such as
behind separate vRoot-Ports in a vIOMMU config, but really, who cares?
As isolation support improves we see fewer multi-device groups, this
scenario becomes the exception. Buy better hardware to use the devices
independently.

> > > > If vfio gets to offload all of it's group management to IOASID code,
> > > > that's great, but I'm afraid that IOASID is so focused on a
> > > > device-level API that we're instead just ignoring the group dynamics
> > > > and vfio will be forced to provide oversight to maintain secure
> > > > userspace access. Thanks,
> > > >
> > >
> > > In summary, the security of the group dynamics are handled through
> > > block-DMA plus existing vfio_group_viable mechanism in this device-
> > > centric design. VFIO still keeps its group management, but no need
> > > to track the attaching status for allowing user access.
> >
> > Still seems pretty loosely defined to me, the DMA blocking mechanism
>
> Sorry for that and hope the explanation in this mail makes it clearer.
>
> > isn't specified, there's no verification of the IOMMU context for
> > "stray" group devices, the group management is based in the IOASID
> > consumer code leading to varying degrees of implementation and
> > effectiveness between callers, we lean more heavily on a fragile
> > notifier to notice and hit the panic button on violation.
> >
> > It would make a lot more sense to me if the model were for vfio to
> > bind groups to /dev/iommu, the IOASID code manages group integrity, and
> > devices can still be moved between IOASIDs as is the overall goal. The
> > group is the basis of ownership, which makes it a worthwhile part of
> > the API. Thanks,
> >
>
> Having explained the device-centric design, honestly speaking I think
> your model could also work. Group is an iommu concept, thus not
> unsound by asking /dev/iommu to manage the group integrity, e.g.
> moving the block-DMA verification and vfio_group_viable() into
> /dev/iommu and verify it when doing group binding. A successful
> group binding implies all group verification passed then user access
> can be allowed. But even doing so I don't expect /dev/iommu uAPI
> will include any explicit group semantics. Just the in-kernel helper
> functions accepts group via VFIO_GROUP_BIND_IOMMU_FD.

As above, how unique is the vfio use case? IMO, if there are cases
where we're providing userspace DMA capable access to a device and
we're not taking into account the full IOMMU group isolation model of
that device, it's broken. Is it really feasible to expect every
consumer of the interface to do this sort of homework?

There are aspects here that I like, it would be convenient to hide
groups, but I'm also rediscovering how many problems we solved with our
usage of groups in vfio. Thanks,

Alex

2021-06-18 03:07:45

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: Plan for /dev/ioasid RFC v2

On Tue, Jun 15, 2021 at 10:12:15AM -0600, Alex Williamson wrote:
>
> 1) A dual-function PCIe e1000e NIC where the functions are grouped
> together due to ACS isolation issues.
>
> a) Initial state: functions 0 & 1 are both bound to e1000e driver.
>
> b) Admin uses driverctl to bind function 1 to vfio-pci, creating
> vfio device file, which is chmod'd to grant to a user.
>
> c) User opens vfio function 1 device file and an iommu_fd, binds
> device_fd to iommu_fd.
>
> Does this succeed?
> - if no, specifically where does it fail?

No, the e1000e driver is still connected to the device.

It fails during the VFIO_BIND_IOASID_FD call because the iommu common
code checks the group membership for consistency.

We detect it basically the same way things work today, just moved to
the iommu code.

> d) Repeat b) for function 0.
> e) Repeat c), still using function 1, is it different? Where? Why?

Succeeds because all group device members are now bound to vfio

It is hard to predict the nicest way to do all of this, but I would
start by imagining that iommu_fd using drivers (like vfio) will call
some kind of iommu_fd_allow_dma_blocking() call during their probe()
which organizes the machinery to drive this.

> 2) The same NIC as 1)
>
> a) Initial state: functions 0 & 1 bound to vfio-pci, vfio device
> files granted to user, user has bound both device_fds to the same
> iommu_fd.
>
> AIUI, even though not bound to an IOASID, vfio can now enable access
> through the device_fds, right?

Yes

> What specific entity has placed these
> devices into a block DMA state, when, and how?

To keep all the semantics the same it must be done as part of
VFIO_BIND_IOASID_FD.

This will have to go over every device in the group and put it in the
dma blocked state. Riffing on the above this is possible if there is
no attached device driver, or the device driver that is attached has
called iommu_fd_allow_dma_blocking() during its probe()

I haven't gone through all of Kevins notes about how this could be
sorted out directly in the iomumu code though..

> b) Both devices are attached to the same IOASID.
>
> Are we assuming that each device was atomically moved to the new
> IOMMU context by the IOASID code? What if the IOMMU cannot change
> the domain atomically?

What does "atomically" mean here? I assume all IOMMU HW can
change IOASIDs without accidentally leaking traffic
through.

Otherwise that is a major design restriction..

> c) The device_fd for function 1 is detached from the IOASID.
>
> Are we assuming the reverse of b) performed by the IOASID code?

Yes, the IOMMU will change from the active IOASID to the "block DMA"
ioasid in a way that is secure.

> d) The device_fd for function 1 is unbound from the iommu_fd.
>
> Does this succeed?

Yes

> - if yes, what is the resulting IOMMU context of the device and
> who owns it?

device_fd for function 1 remains set to the "block DMA"
ioasid.

Attempting to attach a kernel driver triggers bug_on as today

Attempting to open it again and use it with a different iommu_fd fails

> e) Function 1 is unbound from vfio-pci.
>
> Does this work or is it blocked? If blocked, by what entity
> specifically?

As today, it is allowed. The IOASID would have to remain at the "block
all dma" until the implicit connection to the group in the iommu_fd is
released.

> f) Function 1 is bound to e1000e driver.

As today bug_on is triggered via the same maze of notifiers (gross,
but where we are for now). The notifiers would be done by the iommu_fd
instead of vfio

> 3) A dual-function conventional PCI e1000 NIC where the functions are
> grouped together due to shared RID.

This operates effectively the same as today. Manipulating a device
implicitly manipulates the group. Instead of doing dma block the
devices track the IOASID the group is using.

We model it by demanding that all devices attach to the same IOASID
and instead of doing the DMA block step the device remains attached to
the group's IOASID. Today this is such an uncommon configuration (a
PCI bridge!) we shouldn't design the entire API around it.

> If vfio gets to offload all of it's group management to IOASID code,
> that's great, but I'm afraid that IOASID is so focused on a
> device-level API that we're instead just ignoring the group dynamics
> and vfio will be forced to provide oversight to maintain secure
> userspace access.

I think it would be a major design failure if VFIO is required to
provide additional security on top of the iommu code. This is
basically the refactoring excercise - to move the VFIO code that is
only about iommu concerns to the iommu layer and VFIO becomes thinner.

Otherwise we still can't properly share this code - why should VDPA
and VFIO have different isolation models? Is it just because we expect
that everything except VFIO has 1:1 groups or not group at all? Feels
wonky.

Jason

2021-06-18 03:08:10

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: Plan for /dev/ioasid RFC v2

On Thu, Jun 17, 2021 at 03:14:52PM -0600, Alex Williamson wrote:

> I've referred to this as a limitation of type1, that we can't put
> devices within the same group into different address spaces, such as
> behind separate vRoot-Ports in a vIOMMU config, but really, who cares?
> As isolation support improves we see fewer multi-device groups, this
> scenario becomes the exception. Buy better hardware to use the devices
> independently.

This is basically my thinking too, but my conclusion is that we should
not continue to make groups central to the API.

As I've explained to David this is actually causing functional
problems and mess - and I don't see a clean way to keep groups central
but still have the device in control of what is happening. We need
this device <-> iommu connection to be direct to robustly model all
the things that are in the RFC.

To keep groups central someone needs to sketch out how to solve
today's mdev SW page table and mdev PASID issues in a clean
way. Device centric is my suggestion on how to make it clean, but I
haven't heard an alternative??

So, I view the purpose of this discussion to scope out what a
device-centric world looks like and then if we can securely fit in the
legacy non-isolated world on top of that clean future oriented
API. Then decide if it is work worth doing or not.

To my mind it looks like it is not so bad, granted not every detail is
clear, and no code has be sketched, but I don't see a big scary
blocker emerging. An extra ioctl or two, some special logic that
activates for >1 device groups that looks a lot like VFIO's current
logic..

At some level I would be perfectly fine if we made the group FD part
of the API for >1 device groups - except that complexifies every user
space implementation to deal with that. It doesn't feel like a good
trade off.

Jason

(I've been off this week so I didn't try to read/answer absolutely
everything, just a few things - though it looks like this is settling
down into 'kevin make a specific proposal' kind of situation..)

2021-06-18 03:14:34

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: Plan for /dev/ioasid RFC v2

On Thu, Jun 17, 2021 at 07:31:03AM +0000, Tian, Kevin wrote:
> > > Yes. function 1 is block-DMA while function 0 still attached to IOASID.
> > > Actually unbind from IOMMU fd doesn't change the security context.
> > > the change is conducted when attaching/detaching device to/from an
> > > IOASID.
> >
> > But I think you're suggesting that the IOMMU context is simply the
> > device's default domain, so vfio is left in the position where the user
> > gained access to the device by binding it to an iommu_fd, but now the
> > device exists outside of the iommu_fd.

I don't think unbind should be allowed. Close the fd and re-open it if
you want to attach to a different iommu_fd.

> > to gate device access on binding the device to the iommu_fd? The user
> > can get an accessible device_fd unbound from an iommu_fd on the reverse
> > path.
>
> yes, binding to iommu_fd is not the appropriate point of gating
> device access.

Binding is the only point we have enough information to make a
full security decision. Device FDs that are not bound must be
inoperable until bound.

The complexities with revoking mmap/etc are what lead me to conclude
that unbind is not worth doing - we can't go back to an inoperable
state very easially.

> Yes, that was the original impression. But after figuring out the new
> block-DMA behavior, I'm not sure whether /dev/iommu must maintain
> its own group integrity check. If it trusts vfio, I feel it's fine to avoid
> such check which even allows a group of devices bound to different
> IOMMU fd's if user likes. Also if we want to sustain the current vfio
> semantics which doesn't require all devices in the group bound to
> vfio driver, seems it's pointless to enforce such integrity check in
> /dev/iommu.
>
> Jason, what's your opinion?

I think the iommu code should do all of this, I don't see why vfio
should be dealing with *iommu* isolation.

The rest of this email got a bit long for me to catch up on, sorry :\

Jason

2021-06-18 04:52:12

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: Plan for /dev/ioasid RFC v2

On Thu, Jun 17, 2021 at 03:02:33PM +1000, David Gibson wrote:

> In other words, do we really have use cases where we need to identify
> different devices IDs, even though we know they're not isolated.

I think when PASID is added in and all the complexity that brings, it
does become more important, yes.

At the minimum we should scope the complexity.

I'm not convinced it is so complicated, really it is just a single bit
of information toward userspace: 'all devices in this group must use
the same IOASID'

Something like qemu consumes this bit and creates the pci/pcie bridge
to model this to the guest and so on.

Something like dpdk just doesn't care (same as today).

Jason

2021-06-18 04:53:01

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: Plan for /dev/ioasid RFC v2

On Thu, Jun 17, 2021 at 02:45:46PM +1000, David Gibson wrote:
> On Wed, Jun 09, 2021 at 09:39:19AM -0300, Jason Gunthorpe wrote:
> > On Wed, Jun 09, 2021 at 02:24:03PM +0200, Joerg Roedel wrote:
> > > On Mon, Jun 07, 2021 at 02:58:18AM +0000, Tian, Kevin wrote:
> > > > - Device-centric (Jason) vs. group-centric (David) uAPI. David is not fully
> > > > convinced yet. Based on discussion v2 will continue to have ioasid uAPI
> > > > being device-centric (but it's fine for vfio to be group-centric). A new
> > > > section will be added to elaborate this part;
> > >
> > > I would vote for group-centric here. Or do the reasons for which VFIO is
> > > group-centric not apply to IOASID? If so, why?
> >
> > VFIO being group centric has made it very ugly/difficult to inject
> > device driver specific knowledge into the scheme.
> >
> > The device driver is the only thing that knows to ask:
> > - I need a SW table for this ioasid because I am like a mdev
> > - I will issue TLPs with PASID
> > - I need a IOASID linked to a PASID
> > - I am a devices that uses ENQCMD and vPASID
> > - etc in future
>
> mdev drivers might know these, but shim drivers, like basic vfio-pci
> often won't.

The generic drivers say 'I will do every kind of DMA possible', which
is in-of-itself a special kind of information to convey.

There are alot of weird corners to think about here, like what if the
guest asks for a PASID on a mdev that doesn't support PASID, but
hooked to a RID that does or other quite nonsense combinations. These
need to be blocked/handled/whatever properly, which is made much
easier if the common code actually knows detail about what is going
on.

> I still think you're having a tendency to partially conflate several
> meanings of "group":
> 1. the unavoidable hardware unit of non-isolation
> 2. the kernel internal concept and interface to it
> 3. the user visible fd and interface

I think I have those pretty clearly seperated :)

> We can't avoid having (1) somewhere, (3) and to a lesser extent (2)
> are what you object to.

I don't like (3) either, and am yet to hear a definitive reason why we
must have it..

> > The current approach has the group try to guess the device driver
> > intention in the vfio type 1 code.
>
> I agree this has gotten ugly. What I'm not yet convinced of is that
> reworking groups to make this not-ugly necessarily requires totally
> minimizing the importance of groups.

I think it does - we can't have the group in the middle and still put
the driver in chrage, it doesn't really work.

At least if someone can see an arrangement otherwise lets hear it -
start with how to keep groups and remove the mdev hackery from type1..

Jason

2021-06-18 08:03:38

by Lu Baolu

[permalink] [raw]
Subject: Re: Plan for /dev/ioasid RFC v2

Hi David,

On 6/17/21 1:22 PM, David Gibson wrote:
>> The iommu_group can guarantee the isolation among different physical
>> devices (represented by RIDs). But when it comes to sub-devices (ex. mdev or
>> vDPA devices represented by RID + SSID), we have to rely on the
>> device driver for isolation. The devices which are able to generate sub-
>> devices should either use their own on-device mechanisms or use the
>> platform features like Intel Scalable IOV to isolate the sub-devices.
> This seems like a misunderstanding of groups. Groups are not tied to
> any PCI meaning. Groups are the smallest unit of isolation, no matter
> what is providing that isolation.
>
> If mdevs are isolated from each other by clever software, even though
> they're on the same PCI device they are in different groups from each
> other*by definition*. They are also in a different group from their
> parent device (however the mdevs only exist when mdev driver is
> active, which implies that the parent device's group is owned by the
> kernel).


You are right. This is also my understanding of an "isolation group".

But, as I understand it, iommu_group is only the isolation group visible
to IOMMU. When we talk about sub-devices (sw-mdev or mdev w/ pasid),
only the device and device driver knows the details of isolation, hence
iommu_group could not be extended to cover them. The device drivers
should define their own isolation groups.

Otherwise, the device driver has to fake an iommu_group and add hacky
code to link the related IOMMU elements (iommu device, domain, group
etc.) together. Actually this is part of the problem that this proposal
tries to solve.

>
>> Under above conditions, different sub-device from a same RID device
>> could be able to use different IOASID. This seems to means that we can't
>> support mixed mode where, for example, two RIDs share an iommu_group and
>> one (or both) of them have sub-devices.
> That doesn't necessarily follow. mdevs which can be successfully
> isolated by their mdev driver are in a different group from their
> parent device, and therefore need not be affected by whether the
> parent device shares a group with some other physical device. They
> *might* be, but that's up to the mdev driver to determine based on
> what it can safely isolate.
>

If we understand it as multiple levels of isolation, can we classify the
devices into the following categories?

1) Legacy devices
- devices without device-level isolation
- multiple devices could sit in a single iommu_group
- only a single I/O address space could be bound to IOMMU

2) Modern devices
- devices capable of device-level isolation
- able to have subdevices
- self-isolated, hence not share iommu_group with others
- multiple I/O address spaces could be bound to IOMMU

For 1), all devices in an iommu_group should be bound to a single
IOASID; The isolation is guaranteed by an iommu_group.

For 2) a single device could be bound to multiple IOASIDs with each sub-
device corresponding to an IOASID. The isolation of each subdevice is
guaranteed by the device driver.

Best regards,
baolu

2021-06-18 13:50:37

by Joerg Roedel

[permalink] [raw]
Subject: Re: Plan for /dev/ioasid RFC v2

Hi Kevin,

On Thu, Jun 17, 2021 at 07:31:03AM +0000, Tian, Kevin wrote:
> Now let's talk about the new IOMMU behavior:
>
> - A device is blocked from doing DMA to any resource outside of
> its group when it's probed by the IOMMU driver. This could be a
> special state w/o attaching to any domain, or a new special domain
> type which differentiates it from existing domain types (identity,
> dma, or unmanged). Actually existing code already includes a
> IOMMU_DOMAIN_BLOCKED type but nobody uses it.

There is a reason for the default domain to exist: Devices which require
RMRR mappings to be present. You can't just block all DMA from devices
until a driver takes over, we put much effort into making sure there is
not even a small window in time where RMRR regions (unity mapped regions
on AMD) are not mapped.

And if a device has no RMRR regions defined, then the default domain
will be identical to a blocking domain. Device driver bugs don't count
here, as they can be fixed. The kernel trusts itself, so we can rely on
drivers unmapping all of their DMA buffers. Maybe that should be checked
by dma-debug to find violations there.

Regards,

Joerg

2021-06-18 18:31:30

by Ashok Raj

[permalink] [raw]
Subject: Re: Plan for /dev/ioasid RFC v2

On Fri, Jun 18, 2021 at 12:15:06PM -0300, Jason Gunthorpe wrote:
> On Fri, Jun 18, 2021 at 03:47:51PM +0200, Joerg Roedel wrote:
> > Hi Kevin,
> >
> > On Thu, Jun 17, 2021 at 07:31:03AM +0000, Tian, Kevin wrote:
> > > Now let's talk about the new IOMMU behavior:
> > >
> > > - A device is blocked from doing DMA to any resource outside of
> > > its group when it's probed by the IOMMU driver. This could be a
> > > special state w/o attaching to any domain, or a new special domain
> > > type which differentiates it from existing domain types (identity,
> > > dma, or unmanged). Actually existing code already includes a
> > > IOMMU_DOMAIN_BLOCKED type but nobody uses it.
> >
> > There is a reason for the default domain to exist: Devices which require
> > RMRR mappings to be present. You can't just block all DMA from devices
> > until a driver takes over, we put much effort into making sure there is
> > not even a small window in time where RMRR regions (unity mapped regions
> > on AMD) are not mapped.
>
> Yes, I think the DMA blocking can only start around/after a VFIO type
> driver has probed() and bound to a device in the group, not much
> different from today.

Does this mean when a device has a required "RMRR" that requires a unity
mapping we block assigning those devices to guests? I remember we had some
restriction but there was a need to go around it at some point in time.

- Either we disallow assigning devices with RMRR
- Break that unity map when the device is probed and after which any RMRR
access from device will fault.

Cheers,
Ashok

2021-06-18 19:21:41

by Tian, Kevin

[permalink] [raw]
Subject: RE: Plan for /dev/ioasid RFC v2

> From: Jason Gunthorpe <[email protected]>
> Sent: Friday, June 18, 2021 8:20 AM
>
> On Thu, Jun 17, 2021 at 03:14:52PM -0600, Alex Williamson wrote:
>
> > I've referred to this as a limitation of type1, that we can't put
> > devices within the same group into different address spaces, such as
> > behind separate vRoot-Ports in a vIOMMU config, but really, who cares?
> > As isolation support improves we see fewer multi-device groups, this
> > scenario becomes the exception. Buy better hardware to use the devices
> > independently.
>
> This is basically my thinking too, but my conclusion is that we should
> not continue to make groups central to the API.
>
> As I've explained to David this is actually causing functional
> problems and mess - and I don't see a clean way to keep groups central
> but still have the device in control of what is happening. We need
> this device <-> iommu connection to be direct to robustly model all
> the things that are in the RFC.
>
> To keep groups central someone needs to sketch out how to solve
> today's mdev SW page table and mdev PASID issues in a clean
> way. Device centric is my suggestion on how to make it clean, but I
> haven't heard an alternative??
>
> So, I view the purpose of this discussion to scope out what a
> device-centric world looks like and then if we can securely fit in the
> legacy non-isolated world on top of that clean future oriented
> API. Then decide if it is work worth doing or not.
>
> To my mind it looks like it is not so bad, granted not every detail is
> clear, and no code has be sketched, but I don't see a big scary
> blocker emerging. An extra ioctl or two, some special logic that
> activates for >1 device groups that looks a lot like VFIO's current
> logic..
>
> At some level I would be perfectly fine if we made the group FD part
> of the API for >1 device groups - except that complexifies every user
> space implementation to deal with that. It doesn't feel like a good
> trade off.
>

Would it be an acceptable tradeoff by leaving >1 device groups
supported only via legacy VFIO (which is anyway kept for backward
compatibility), if we think such scenario is being deprecated over
time (thus little value to add new features on it)? Then all new
sub-systems including vdpa and new vfio only support singleton
device group via /dev/iommu...

Thanks
Kevin

2021-06-18 19:38:27

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: Plan for /dev/ioasid RFC v2

On Fri, Jun 18, 2021 at 04:57:40PM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <[email protected]>
> > Sent: Friday, June 18, 2021 8:20 AM
> >
> > On Thu, Jun 17, 2021 at 03:14:52PM -0600, Alex Williamson wrote:
> >
> > > I've referred to this as a limitation of type1, that we can't put
> > > devices within the same group into different address spaces, such as
> > > behind separate vRoot-Ports in a vIOMMU config, but really, who cares?
> > > As isolation support improves we see fewer multi-device groups, this
> > > scenario becomes the exception. Buy better hardware to use the devices
> > > independently.
> >
> > This is basically my thinking too, but my conclusion is that we should
> > not continue to make groups central to the API.
> >
> > As I've explained to David this is actually causing functional
> > problems and mess - and I don't see a clean way to keep groups central
> > but still have the device in control of what is happening. We need
> > this device <-> iommu connection to be direct to robustly model all
> > the things that are in the RFC.
> >
> > To keep groups central someone needs to sketch out how to solve
> > today's mdev SW page table and mdev PASID issues in a clean
> > way. Device centric is my suggestion on how to make it clean, but I
> > haven't heard an alternative??
> >
> > So, I view the purpose of this discussion to scope out what a
> > device-centric world looks like and then if we can securely fit in the
> > legacy non-isolated world on top of that clean future oriented
> > API. Then decide if it is work worth doing or not.
> >
> > To my mind it looks like it is not so bad, granted not every detail is
> > clear, and no code has be sketched, but I don't see a big scary
> > blocker emerging. An extra ioctl or two, some special logic that
> > activates for >1 device groups that looks a lot like VFIO's current
> > logic..
> >
> > At some level I would be perfectly fine if we made the group FD part
> > of the API for >1 device groups - except that complexifies every user
> > space implementation to deal with that. It doesn't feel like a good
> > trade off.
> >
>
> Would it be an acceptable tradeoff by leaving >1 device groups
> supported only via legacy VFIO (which is anyway kept for backward
> compatibility), if we think such scenario is being deprecated over
> time (thus little value to add new features on it)? Then all new
> sub-systems including vdpa and new vfio only support singleton
> device group via /dev/iommu...

That might just be a great idea - userspace has to support those APIs
anyhow, if it can be made trivially obvious to use this fallback even
though /dev/iommu is available it is a great place to start. It also
means PASID/etc are naturally blocked off.

Maybe years down the road we will want to harmonize them, so I would
still sketch it out enough to be confident it could be implemented..

Jason

2021-06-18 21:26:48

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: Plan for /dev/ioasid RFC v2

On Fri, Jun 18, 2021 at 03:47:51PM +0200, Joerg Roedel wrote:
> Hi Kevin,
>
> On Thu, Jun 17, 2021 at 07:31:03AM +0000, Tian, Kevin wrote:
> > Now let's talk about the new IOMMU behavior:
> >
> > - A device is blocked from doing DMA to any resource outside of
> > its group when it's probed by the IOMMU driver. This could be a
> > special state w/o attaching to any domain, or a new special domain
> > type which differentiates it from existing domain types (identity,
> > dma, or unmanged). Actually existing code already includes a
> > IOMMU_DOMAIN_BLOCKED type but nobody uses it.
>
> There is a reason for the default domain to exist: Devices which require
> RMRR mappings to be present. You can't just block all DMA from devices
> until a driver takes over, we put much effort into making sure there is
> not even a small window in time where RMRR regions (unity mapped regions
> on AMD) are not mapped.

Yes, I think the DMA blocking can only start around/after a VFIO type
driver has probed() and bound to a device in the group, not much
different from today.

Jason

2021-06-18 22:28:33

by Alex Williamson

[permalink] [raw]
Subject: Re: Plan for /dev/ioasid RFC v2

On Fri, 18 Jun 2021 08:37:35 -0700
"Raj, Ashok" <[email protected]> wrote:

> On Fri, Jun 18, 2021 at 12:15:06PM -0300, Jason Gunthorpe wrote:
> > On Fri, Jun 18, 2021 at 03:47:51PM +0200, Joerg Roedel wrote:
> > > Hi Kevin,
> > >
> > > On Thu, Jun 17, 2021 at 07:31:03AM +0000, Tian, Kevin wrote:
> > > > Now let's talk about the new IOMMU behavior:
> > > >
> > > > - A device is blocked from doing DMA to any resource outside of
> > > > its group when it's probed by the IOMMU driver. This could be a
> > > > special state w/o attaching to any domain, or a new special domain
> > > > type which differentiates it from existing domain types (identity,
> > > > dma, or unmanged). Actually existing code already includes a
> > > > IOMMU_DOMAIN_BLOCKED type but nobody uses it.
> > >
> > > There is a reason for the default domain to exist: Devices which require
> > > RMRR mappings to be present. You can't just block all DMA from devices
> > > until a driver takes over, we put much effort into making sure there is
> > > not even a small window in time where RMRR regions (unity mapped regions
> > > on AMD) are not mapped.
> >
> > Yes, I think the DMA blocking can only start around/after a VFIO type
> > driver has probed() and bound to a device in the group, not much
> > different from today.
>
> Does this mean when a device has a required "RMRR" that requires a unity
> mapping we block assigning those devices to guests? I remember we had some
> restriction but there was a need to go around it at some point in time.
>
> - Either we disallow assigning devices with RMRR
> - Break that unity map when the device is probed and after which any RMRR
> access from device will fault.

We currently disallow assignment of RMRR encumbered devices except for
the known cases of USB and IGD. In the general case, an RMRR imposes
a requirement on the host system to maintain ranges of identity mapping
that is incompatible with userspace ownership of the device and IOVA
address space. AFAICT, nothing changes in the /dev/iommu model that
would make it safe to entrust userspace with RMRR encumbered devices.
Thanks,

Alex

2021-06-24 04:54:32

by David Gibson

[permalink] [raw]
Subject: Re: Plan for /dev/ioasid RFC v2

On Thu, Jun 17, 2021 at 08:10:04PM -0300, Jason Gunthorpe wrote:
> On Thu, Jun 17, 2021 at 02:45:46PM +1000, David Gibson wrote:
> > On Wed, Jun 09, 2021 at 09:39:19AM -0300, Jason Gunthorpe wrote:
> > > On Wed, Jun 09, 2021 at 02:24:03PM +0200, Joerg Roedel wrote:
> > > > On Mon, Jun 07, 2021 at 02:58:18AM +0000, Tian, Kevin wrote:
> > > > > - Device-centric (Jason) vs. group-centric (David) uAPI. David is not fully
> > > > > convinced yet. Based on discussion v2 will continue to have ioasid uAPI
> > > > > being device-centric (but it's fine for vfio to be group-centric). A new
> > > > > section will be added to elaborate this part;
> > > >
> > > > I would vote for group-centric here. Or do the reasons for which VFIO is
> > > > group-centric not apply to IOASID? If so, why?
> > >
> > > VFIO being group centric has made it very ugly/difficult to inject
> > > device driver specific knowledge into the scheme.
> > >
> > > The device driver is the only thing that knows to ask:
> > > - I need a SW table for this ioasid because I am like a mdev
> > > - I will issue TLPs with PASID
> > > - I need a IOASID linked to a PASID
> > > - I am a devices that uses ENQCMD and vPASID
> > > - etc in future
> >
> > mdev drivers might know these, but shim drivers, like basic vfio-pci
> > often won't.
>
> The generic drivers say 'I will do every kind of DMA possible', which
> is in-of-itself a special kind of information to convey.
>
> There are alot of weird corners to think about here, like what if the
> guest asks for a PASID on a mdev that doesn't support PASID, but
> hooked to a RID that does or other quite nonsense combinations. These
> need to be blocked/handled/whatever properly, which is made much
> easier if the common code actually knows detail about what is going
> on.
>
> > I still think you're having a tendency to partially conflate several
> > meanings of "group":
> > 1. the unavoidable hardware unit of non-isolation
> > 2. the kernel internal concept and interface to it
> > 3. the user visible fd and interface
>
> I think I have those pretty clearly seperated :)
>
> > We can't avoid having (1) somewhere, (3) and to a lesser extent (2)
> > are what you object to.
>
> I don't like (3) either, and am yet to hear a definitive reason why we
> must have it..

I don't know that there's a "definitive" reason. My concern (and I
think Alex's as well) is that if there's no (3), it tends to lead to a
lack of (2), and lack of (2) tends to make people sloppily forget
about (1) and lead to breakage.

> > > The current approach has the group try to guess the device driver
> > > intention in the vfio type 1 code.
> >
> > I agree this has gotten ugly. What I'm not yet convinced of is that
> > reworking groups to make this not-ugly necessarily requires totally
> > minimizing the importance of groups.
>
> I think it does - we can't have the group in the middle and still put
> the driver in chrage, it doesn't really work.
>
> At least if someone can see an arrangement otherwise lets hear it -
> start with how to keep groups and remove the mdev hackery from type1..
>
> Jason
>

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (3.33 kB)
signature.asc (849.00 B)
Download all attachments

2021-06-24 04:55:47

by David Gibson

[permalink] [raw]
Subject: Re: Plan for /dev/ioasid RFC v2

On Thu, Jun 17, 2021 at 08:04:38PM -0300, Jason Gunthorpe wrote:
> On Thu, Jun 17, 2021 at 03:02:33PM +1000, David Gibson wrote:
>
> > In other words, do we really have use cases where we need to identify
> > different devices IDs, even though we know they're not isolated.
>
> I think when PASID is added in and all the complexity that brings, it
> does become more important, yes.
>
> At the minimum we should scope the complexity.
>
> I'm not convinced it is so complicated, really it is just a single bit
> of information toward userspace: 'all devices in this group must use
> the same IOASID'

Um.. no? You could have devA and devB sharing a RID, but then also
sharing a group but not a RID with devC because of different isolation
issues. So you now have (at least) two levels of group structure to
expose somehow.

>
> Something like qemu consumes this bit and creates the pci/pcie bridge
> to model this to the guest and so on.
>
> Something like dpdk just doesn't care (same as today).
>
> Jason
>

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (1.21 kB)
signature.asc (849.00 B)
Download all attachments

2021-06-24 04:56:14

by David Gibson

[permalink] [raw]
Subject: Re: Plan for /dev/ioasid RFC v2

On Fri, Jun 18, 2021 at 12:15:06PM -0300, Jason Gunthorpe wrote:
> On Fri, Jun 18, 2021 at 03:47:51PM +0200, Joerg Roedel wrote:
> > Hi Kevin,
> >
> > On Thu, Jun 17, 2021 at 07:31:03AM +0000, Tian, Kevin wrote:
> > > Now let's talk about the new IOMMU behavior:
> > >
> > > - A device is blocked from doing DMA to any resource outside of
> > > its group when it's probed by the IOMMU driver. This could be a
> > > special state w/o attaching to any domain, or a new special domain
> > > type which differentiates it from existing domain types (identity,
> > > dma, or unmanged). Actually existing code already includes a
> > > IOMMU_DOMAIN_BLOCKED type but nobody uses it.
> >
> > There is a reason for the default domain to exist: Devices which require
> > RMRR mappings to be present. You can't just block all DMA from devices
> > until a driver takes over, we put much effort into making sure there is
> > not even a small window in time where RMRR regions (unity mapped regions
> > on AMD) are not mapped.
>
> Yes, I think the DMA blocking can only start around/after a VFIO type
> driver has probed() and bound to a device in the group, not much
> different from today.

But as I keep saying, some forms of grouping (and DMA aliasing as Alex
mentioned) mean that changing the domain of one device can change the
domain of another device, unavoidably. It may be rare with modern
hardware, but we still can't ignore the case.

Which means you can't DMA block until everything in the group is
managed by a vfio-like driver.

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (1.74 kB)
signature.asc (849.00 B)
Download all attachments

2021-06-24 04:56:41

by David Gibson

[permalink] [raw]
Subject: Re: Plan for /dev/ioasid RFC v2

On Fri, Jun 18, 2021 at 04:57:40PM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <[email protected]>
> > Sent: Friday, June 18, 2021 8:20 AM
> >
> > On Thu, Jun 17, 2021 at 03:14:52PM -0600, Alex Williamson wrote:
> >
> > > I've referred to this as a limitation of type1, that we can't put
> > > devices within the same group into different address spaces, such as
> > > behind separate vRoot-Ports in a vIOMMU config, but really, who cares?
> > > As isolation support improves we see fewer multi-device groups, this
> > > scenario becomes the exception. Buy better hardware to use the devices
> > > independently.
> >
> > This is basically my thinking too, but my conclusion is that we should
> > not continue to make groups central to the API.
> >
> > As I've explained to David this is actually causing functional
> > problems and mess - and I don't see a clean way to keep groups central
> > but still have the device in control of what is happening. We need
> > this device <-> iommu connection to be direct to robustly model all
> > the things that are in the RFC.
> >
> > To keep groups central someone needs to sketch out how to solve
> > today's mdev SW page table and mdev PASID issues in a clean
> > way. Device centric is my suggestion on how to make it clean, but I
> > haven't heard an alternative??
> >
> > So, I view the purpose of this discussion to scope out what a
> > device-centric world looks like and then if we can securely fit in the
> > legacy non-isolated world on top of that clean future oriented
> > API. Then decide if it is work worth doing or not.
> >
> > To my mind it looks like it is not so bad, granted not every detail is
> > clear, and no code has be sketched, but I don't see a big scary
> > blocker emerging. An extra ioctl or two, some special logic that
> > activates for >1 device groups that looks a lot like VFIO's current
> > logic..
> >
> > At some level I would be perfectly fine if we made the group FD part
> > of the API for >1 device groups - except that complexifies every user
> > space implementation to deal with that. It doesn't feel like a good
> > trade off.
> >
>
> Would it be an acceptable tradeoff by leaving >1 device groups
> supported only via legacy VFIO (which is anyway kept for backward
> compatibility), if we think such scenario is being deprecated over
> time (thus little value to add new features on it)? Then all new
> sub-systems including vdpa and new vfio only support singleton
> device group via /dev/iommu...

The case that worries me here is if you *thought* you had 1 device
groups, but then discover a hardware bug which means two things aren't
as isolated as you thought they were. What do you do then?

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (2.89 kB)
signature.asc (849.00 B)
Download all attachments

2021-06-24 04:57:33

by David Gibson

[permalink] [raw]
Subject: Re: Plan for /dev/ioasid RFC v2

On Fri, Jun 18, 2021 at 01:21:47PM +0800, Lu Baolu wrote:
> Hi David,
>
> On 6/17/21 1:22 PM, David Gibson wrote:
> > > The iommu_group can guarantee the isolation among different physical
> > > devices (represented by RIDs). But when it comes to sub-devices (ex. mdev or
> > > vDPA devices represented by RID + SSID), we have to rely on the
> > > device driver for isolation. The devices which are able to generate sub-
> > > devices should either use their own on-device mechanisms or use the
> > > platform features like Intel Scalable IOV to isolate the sub-devices.
> > This seems like a misunderstanding of groups. Groups are not tied to
> > any PCI meaning. Groups are the smallest unit of isolation, no matter
> > what is providing that isolation.
> >
> > If mdevs are isolated from each other by clever software, even though
> > they're on the same PCI device they are in different groups from each
> > other*by definition*. They are also in a different group from their
> > parent device (however the mdevs only exist when mdev driver is
> > active, which implies that the parent device's group is owned by the
> > kernel).
>
>
> You are right. This is also my understanding of an "isolation group".
>
> But, as I understand it, iommu_group is only the isolation group visible
> to IOMMU. When we talk about sub-devices (sw-mdev or mdev w/ pasid),
> only the device and device driver knows the details of isolation, hence
> iommu_group could not be extended to cover them. The device drivers
> should define their own isolation groups.

So, "iommu group" isn't a perfect name. It came about because
originally the main mechanism for isolation was the IOMMU, so it was
typically the IOMMU's capabilities that determined if devices were
isolated. However it was always known that there could be other
reasons for failure of isolation. To simplify the model we decided
that we'd put things into the same group if they were non-isolated for
any reason.

The kernel has no notion of "isolation group" as distinct from "iommu
group". What are called iommu groups in the kernel now *are*
"isolation groups" and that was always the intention - it's just not a
great name.

> Otherwise, the device driver has to fake an iommu_group and add hacky
> code to link the related IOMMU elements (iommu device, domain, group
> etc.) together. Actually this is part of the problem that this proposal
> tries to solve.

Yeah, that's not ideal.

> > > Under above conditions, different sub-device from a same RID device
> > > could be able to use different IOASID. This seems to means that we can't
> > > support mixed mode where, for example, two RIDs share an iommu_group and
> > > one (or both) of them have sub-devices.
> > That doesn't necessarily follow. mdevs which can be successfully
> > isolated by their mdev driver are in a different group from their
> > parent device, and therefore need not be affected by whether the
> > parent device shares a group with some other physical device. They
> > *might* be, but that's up to the mdev driver to determine based on
> > what it can safely isolate.
> >
>
> If we understand it as multiple levels of isolation, can we classify the
> devices into the following categories?
>
> 1) Legacy devices
> - devices without device-level isolation
> - multiple devices could sit in a single iommu_group
> - only a single I/O address space could be bound to IOMMU

I'm not really clear on what that last statement means.

> 2) Modern devices
> - devices capable of device-level isolation

This will *typically* be true of modern devices, but I don't think we
can really make it a hard API distinction. Legacy or buggy bridges
can force modern devices into the same group as each other. Modern
devices are not immune from bugs which would force lack of isolation
(e.g. forgotten debug registers on function 0 which affect other
functions).

> - able to have subdevices
> - self-isolated, hence not share iommu_group with others
> - multiple I/O address spaces could be bound to IOMMU
>
> For 1), all devices in an iommu_group should be bound to a single
> IOASID; The isolation is guaranteed by an iommu_group.
>
> For 2) a single device could be bound to multiple IOASIDs with each sub-
> device corresponding to an IOASID. The isolation of each subdevice is
> guaranteed by the device driver.
>
> Best regards,
> baolu
>

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (4.58 kB)
signature.asc (849.00 B)
Download all attachments

2021-06-24 04:57:43

by David Gibson

[permalink] [raw]
Subject: Re: Plan for /dev/ioasid RFC v2

On Thu, Jun 17, 2021 at 03:14:52PM -0600, Alex Williamson wrote:
> On Thu, 17 Jun 2021 07:31:03 +0000
> "Tian, Kevin" <[email protected]> wrote:
> > > From: Alex Williamson <[email protected]>
> > > Sent: Thursday, June 17, 2021 3:40 AM
> > > On Wed, 16 Jun 2021 06:43:23 +0000
> > > "Tian, Kevin" <[email protected]> wrote:
> > > > > From: Alex Williamson <[email protected]>
> > > > > Sent: Wednesday, June 16, 2021 12:12 AM
> > > > > On Tue, 15 Jun 2021 02:31:39 +0000
> > > > > "Tian, Kevin" <[email protected]> wrote:
> > > > > > > From: Alex Williamson <[email protected]>
> > > > > > > Sent: Tuesday, June 15, 2021 12:28 AM
[snip]

> > > > > 3) A dual-function conventional PCI e1000 NIC where the functions are
> > > > > grouped together due to shared RID.
> > > > >
> > > > > a) Repeat 2.a) and 2.b) such that we have a valid, user accessible
> > > > > devices in the same IOMMU context.
> > > > >
> > > > > b) Function 1 is detached from the IOASID.
> > > > >
> > > > > I think function 1 cannot be placed into a different IOMMU context
> > > > > here, does the detach work? What's the IOMMU context now?
> > > >
> > > > Yes. Function 1 is back to block-DMA. Since both functions share RID,
> > > > essentially it implies function 0 is in block-DMA state too (though its
> > > > tracking state may not change yet) since the shared IOMMU context
> > > > entry blocks DMA now. In IOMMU fd function 0 is still attached to the
> > > > IOASID thus the user still needs do an explicit detach to clear the
> > > > tracking state for function 0.
> > > >
> > > > >
> > > > > c) A new IOASID is alloc'd within the existing iommu_fd and function
> > > > > 1 is attached to the new IOASID.
> > > > >
> > > > > Where, how, by whom does this fail?
> > > >
> > > > No need to fail. It can succeed since doing so just hurts user's own foot.
> > > >
> > > > The only question is how user knows the fact that a group of devices
> > > > share RID thus avoid such thing. I'm curious how it is communicated
> > > > with today's VFIO mechanism. Yes the group-centric VFIO uAPI prevents
> > > > a group of devices from attaching to multiple IOMMU contexts, but
> > > > suppose we still need a way to tell the user to not do so. Especially
> > > > such knowledge would be also reflected in the virtual PCI topology
> > > > when the entire group is assigned to the guest which needs to know
> > > > this fact when vIOMMU is exposed. I haven't found time to investigate
> > > > it but suppose if such channel exists it could be reused, or in the worst
> > > > case we may have the new device capability interface to convey...
> > >
> > > No such channel currently exists, it's not an issue today, IOMMU
> > > context is group-based.
> >
> > Interesting... If such group of devices are assigned to a guest, how does
> > Qemu decide the virtual PCI topology for them? Do they have same
> > vRID or different?
>
> That's the beauty of it, it doesn't matter how many RIDs exist in the
> group, or which devices have aliases, the group is the minimum
> granularity of a container where QEMU knows that a container provides
> a single address space. Therefore a container must exist in a single
> address space in the PCI topology. In a conventional or non-vIOMMU
> topology, the PCI address space is equivalent to the system memory
> address space. When vIOMMU gets involved, multiple devices within the
> same group must exist in the same address space. A vPCIe-to-PCI bridge
> can be used to create that shared address space.
>
> I've referred to this as a limitation of type1, that we can't put
> devices within the same group into different address spaces, such as
> behind separate vRoot-Ports in a vIOMMU config, but really, who cares?
> As isolation support improves we see fewer multi-device groups, this
> scenario becomes the exception. Buy better hardware to use the devices
> independently.

Also, that limitation is fundamental. Groups in a guest must always
be the same or strictly bigger than groups in the host, because if the
real hardware can't isolate them, then the virtual hardware certainly
can't and the guest kernel shouldn't be given the impression that it
can separate them.

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (4.44 kB)
signature.asc (849.00 B)
Download all attachments

2021-06-24 04:58:10

by David Gibson

[permalink] [raw]
Subject: Re: Plan for /dev/ioasid RFC v2

On Tue, Jun 15, 2021 at 01:21:35AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <[email protected]>
> > Sent: Monday, June 14, 2021 9:38 PM
> >
> > On Mon, Jun 14, 2021 at 03:09:31AM +0000, Tian, Kevin wrote:
> >
> > > If a device can be always blocked from accessing memory in the IOMMU
> > > before it's bound to a driver or more specifically before the driver
> > > moves it to a new security context, then there is no need for VFIO
> > > to track whether IOASIDfd has taken over ownership of the DMA
> > > context for all devices within a group.
> >
> > I've been assuming we'd do something like this, where when a device is
> > first turned into a VFIO it tells the IOMMU layer that this device
> > should be DMA blocked unless an IOASID is attached to
> > it. Disconnecting an IOASID returns it to blocked.
>
> Or just make sure a device is in block-DMA when it's unbound from a
> driver or a security context.

So I'm not entirely clear here if you're envisaging putting the device
into no-DMA mode by altering the IOMMU setup or by quiescing it at the
register level (e.g. by resetting it). But, neither approach allows
you to safely put a device into no-DMA mode if users have access to
another device in the group.

The IOMMU approach doesn't work, because the IOMMU may not be able to
distinguish the two devices from each other.

The register approach doesn't work, because even if you successfully
quiesce the device, the user could poke it indirectly via the other
device in the group, pulling it out of quiescent mode.

> Then no need to explicitly tell IOMMU layer
> to do so when it's bound to a new driver.
>
> Currently the default domain type applies even when a device is not
> bound. This implies that if iommu=passthrough a device is always
> allowed to access arbitrary system memory with or without a driver.
> I feel the current domain type (identity, dma, unmanged) should apply
> only when a driver is loaded...

A whole group has to be in the same DMA context at the same time.
That's the definition of a group.

> > > If this works I didn't see the need for vfio to keep the sequence.
> > > VFIO still keeps group fd to claim ownership of all devices in a
> > > group.
> >
> > As Alex says you still have to deal with the problem that device A in
> > a group can gain control of device B in the same group.
>
> There is no isolation in the group then how could vfio prevent device
> A from gaining control of device B? for example when both are attached
> to the same GPA address space with device MMIO bar included, devA
> can do p2p to devB. It's all user's policy how to deal with devices within
> the group.
>
> >
> > This means device A and B can not be used from to two different
> > security contexts.
>
> It depends on how the security context is defined. From iommu layer
> p.o.v, an IOASID is a security context which isolates a device from
> the rest of the system (but not the sibling in the same group). As you
> suggested earlier, it's completely sane if an user wants to attach
> devices in a group to different IOASIDs. Here I just talk about this fact.
>
> >
> > If the /dev/iommu FD is the security context then the tracking is
> > needed there.
> >
>
> As I replied to Alex, my point is that VFIO doesn't need to know the
> attaching status of each device in a group before it can allow user to
> access a device. As long as a device in a group either in block DMA
> or switch to a new address space created via /dev/iommu FD, there's
> no problem to allow user accessing it. User cannot do harm to the
> world outside of the group. User knows there is no isolation within
> the group. that is it.
>
> Thanks
> Kevin
>

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (3.88 kB)
signature.asc (849.00 B)
Download all attachments

2021-06-24 06:01:36

by Tian, Kevin

[permalink] [raw]
Subject: RE: Plan for /dev/ioasid RFC v2

> From: David Gibson
> Sent: Thursday, June 24, 2021 12:26 PM
>
> On Fri, Jun 18, 2021 at 04:57:40PM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <[email protected]>
> > > Sent: Friday, June 18, 2021 8:20 AM
> > >
> > > On Thu, Jun 17, 2021 at 03:14:52PM -0600, Alex Williamson wrote:
> > >
> > > > I've referred to this as a limitation of type1, that we can't put
> > > > devices within the same group into different address spaces, such as
> > > > behind separate vRoot-Ports in a vIOMMU config, but really, who cares?
> > > > As isolation support improves we see fewer multi-device groups, this
> > > > scenario becomes the exception. Buy better hardware to use the
> devices
> > > > independently.
> > >
> > > This is basically my thinking too, but my conclusion is that we should
> > > not continue to make groups central to the API.
> > >
> > > As I've explained to David this is actually causing functional
> > > problems and mess - and I don't see a clean way to keep groups central
> > > but still have the device in control of what is happening. We need
> > > this device <-> iommu connection to be direct to robustly model all
> > > the things that are in the RFC.
> > >
> > > To keep groups central someone needs to sketch out how to solve
> > > today's mdev SW page table and mdev PASID issues in a clean
> > > way. Device centric is my suggestion on how to make it clean, but I
> > > haven't heard an alternative??
> > >
> > > So, I view the purpose of this discussion to scope out what a
> > > device-centric world looks like and then if we can securely fit in the
> > > legacy non-isolated world on top of that clean future oriented
> > > API. Then decide if it is work worth doing or not.
> > >
> > > To my mind it looks like it is not so bad, granted not every detail is
> > > clear, and no code has be sketched, but I don't see a big scary
> > > blocker emerging. An extra ioctl or two, some special logic that
> > > activates for >1 device groups that looks a lot like VFIO's current
> > > logic..
> > >
> > > At some level I would be perfectly fine if we made the group FD part
> > > of the API for >1 device groups - except that complexifies every user
> > > space implementation to deal with that. It doesn't feel like a good
> > > trade off.
> > >
> >
> > Would it be an acceptable tradeoff by leaving >1 device groups
> > supported only via legacy VFIO (which is anyway kept for backward
> > compatibility), if we think such scenario is being deprecated over
> > time (thus little value to add new features on it)? Then all new
> > sub-systems including vdpa and new vfio only support singleton
> > device group via /dev/iommu...
>
> The case that worries me here is if you *thought* you had 1 device
> groups, but then discover a hardware bug which means two things aren't
> as isolated as you thought they were. What do you do then?

I didn't get your point. If such hardware bug leaves two associated
devices in separate groups, what can software do? Even with existing
VFIO mechanism they can be attached to different containers before
the bug is identified since the kernel thinks they are isolated. If the
after-fact mitigation is to kill the VM and then force two devices
attached to a single VFIO container (after such hardware bug is identified),
same mitigation can be applied here i.e. the user should fall back to
legacy VFIO instead of attempting to use new /dev/iommu for such
devices...

Thanks
Kevin

2021-06-24 11:59:25

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: Plan for /dev/ioasid RFC v2

On Thu, Jun 24, 2021 at 02:37:31PM +1000, David Gibson wrote:
> On Thu, Jun 17, 2021 at 08:04:38PM -0300, Jason Gunthorpe wrote:
> > On Thu, Jun 17, 2021 at 03:02:33PM +1000, David Gibson wrote:
> >
> > > In other words, do we really have use cases where we need to identify
> > > different devices IDs, even though we know they're not isolated.
> >
> > I think when PASID is added in and all the complexity that brings, it
> > does become more important, yes.
> >
> > At the minimum we should scope the complexity.
> >
> > I'm not convinced it is so complicated, really it is just a single bit
> > of information toward userspace: 'all devices in this group must use
> > the same IOASID'
>
> Um.. no? You could have devA and devB sharing a RID, but then also
> sharing a group but not a RID with devC because of different isolation
> issues. So you now have (at least) two levels of group structure to
> expose somehow.

Why? I don't need to micro optimize for broken systems. a/b/c can be
in the same group and the group can have the bit set.

It is no worse than what we have today.

Jason

2021-06-24 11:59:51

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: Plan for /dev/ioasid RFC v2

On Thu, Jun 24, 2021 at 02:29:29PM +1000, David Gibson wrote:

> But as I keep saying, some forms of grouping (and DMA aliasing as Alex
> mentioned) mean that changing the domain of one device can change the
> domain of another device, unavoidably. It may be rare with modern
> hardware, but we still can't ignore the case.
>
> Which means you can't DMA block until everything in the group is
> managed by a vfio-like driver.

We just need the same restriction as today, the group fd will attach a
domain under quite a wide range of conditions, and we can copy that.

IIRC is not a requirement today that every device in the group have a
vfio driver bound to it?

Jason

2021-06-24 12:26:16

by Lu Baolu

[permalink] [raw]
Subject: Re: Plan for /dev/ioasid RFC v2

On 2021/6/24 12:26, David Gibson wrote:
> On Fri, Jun 18, 2021 at 04:57:40PM +0000, Tian, Kevin wrote:
>>> From: Jason Gunthorpe <[email protected]>
>>> Sent: Friday, June 18, 2021 8:20 AM
>>>
>>> On Thu, Jun 17, 2021 at 03:14:52PM -0600, Alex Williamson wrote:
>>>
>>>> I've referred to this as a limitation of type1, that we can't put
>>>> devices within the same group into different address spaces, such as
>>>> behind separate vRoot-Ports in a vIOMMU config, but really, who cares?
>>>> As isolation support improves we see fewer multi-device groups, this
>>>> scenario becomes the exception. Buy better hardware to use the devices
>>>> independently.
>>>
>>> This is basically my thinking too, but my conclusion is that we should
>>> not continue to make groups central to the API.
>>>
>>> As I've explained to David this is actually causing functional
>>> problems and mess - and I don't see a clean way to keep groups central
>>> but still have the device in control of what is happening. We need
>>> this device <-> iommu connection to be direct to robustly model all
>>> the things that are in the RFC.
>>>
>>> To keep groups central someone needs to sketch out how to solve
>>> today's mdev SW page table and mdev PASID issues in a clean
>>> way. Device centric is my suggestion on how to make it clean, but I
>>> haven't heard an alternative??
>>>
>>> So, I view the purpose of this discussion to scope out what a
>>> device-centric world looks like and then if we can securely fit in the
>>> legacy non-isolated world on top of that clean future oriented
>>> API. Then decide if it is work worth doing or not.
>>>
>>> To my mind it looks like it is not so bad, granted not every detail is
>>> clear, and no code has be sketched, but I don't see a big scary
>>> blocker emerging. An extra ioctl or two, some special logic that
>>> activates for >1 device groups that looks a lot like VFIO's current
>>> logic..
>>>
>>> At some level I would be perfectly fine if we made the group FD part
>>> of the API for >1 device groups - except that complexifies every user
>>> space implementation to deal with that. It doesn't feel like a good
>>> trade off.
>>>
>>
>> Would it be an acceptable tradeoff by leaving >1 device groups
>> supported only via legacy VFIO (which is anyway kept for backward
>> compatibility), if we think such scenario is being deprecated over
>> time (thus little value to add new features on it)? Then all new
>> sub-systems including vdpa and new vfio only support singleton
>> device group via /dev/iommu...
>
> The case that worries me here is if you *thought* you had 1 device
> groups, but then discover a hardware bug which means two things aren't
> as isolated as you thought they were. What do you do then?
>

Normally a hardware bug/quirk is identified during boot. For above case,
iommu core should put these two devices in a same iommu_group during
iommu_probe_device() phase. Any runtime hardware bug should be reported
to the OS through various methods so that the device could be quiet
and isolated. I don't think two devices could be in different groups
initially and then be moved to a single one.

Best regards,
baolu

2021-06-24 13:43:59

by Lu Baolu

[permalink] [raw]
Subject: Re: Plan for /dev/ioasid RFC v2

On 2021/6/24 12:03, David Gibson wrote:
> On Fri, Jun 18, 2021 at 01:21:47PM +0800, Lu Baolu wrote:
>> Hi David,
>>
>> On 6/17/21 1:22 PM, David Gibson wrote:
>>>> The iommu_group can guarantee the isolation among different physical
>>>> devices (represented by RIDs). But when it comes to sub-devices (ex. mdev or
>>>> vDPA devices represented by RID + SSID), we have to rely on the
>>>> device driver for isolation. The devices which are able to generate sub-
>>>> devices should either use their own on-device mechanisms or use the
>>>> platform features like Intel Scalable IOV to isolate the sub-devices.
>>> This seems like a misunderstanding of groups. Groups are not tied to
>>> any PCI meaning. Groups are the smallest unit of isolation, no matter
>>> what is providing that isolation.
>>>
>>> If mdevs are isolated from each other by clever software, even though
>>> they're on the same PCI device they are in different groups from each
>>> other*by definition*. They are also in a different group from their
>>> parent device (however the mdevs only exist when mdev driver is
>>> active, which implies that the parent device's group is owned by the
>>> kernel).
>>
>> You are right. This is also my understanding of an "isolation group".
>>
>> But, as I understand it, iommu_group is only the isolation group visible
>> to IOMMU. When we talk about sub-devices (sw-mdev or mdev w/ pasid),
>> only the device and device driver knows the details of isolation, hence
>> iommu_group could not be extended to cover them. The device drivers
>> should define their own isolation groups.
> So, "iommu group" isn't a perfect name. It came about because
> originally the main mechanism for isolation was the IOMMU, so it was
> typically the IOMMU's capabilities that determined if devices were
> isolated. However it was always known that there could be other
> reasons for failure of isolation. To simplify the model we decided
> that we'd put things into the same group if they were non-isolated for
> any reason.

Yes.

>
> The kernel has no notion of "isolation group" as distinct from "iommu
> group". What are called iommu groups in the kernel now*are*
> "isolation groups" and that was always the intention - it's just not a
> great name.

Fair enough.

>
>> Otherwise, the device driver has to fake an iommu_group and add hacky
>> code to link the related IOMMU elements (iommu device, domain, group
>> etc.) together. Actually this is part of the problem that this proposal
>> tries to solve.
> Yeah, that's not ideal.
>
>>>> Under above conditions, different sub-device from a same RID device
>>>> could be able to use different IOASID. This seems to means that we can't
>>>> support mixed mode where, for example, two RIDs share an iommu_group and
>>>> one (or both) of them have sub-devices.
>>> That doesn't necessarily follow. mdevs which can be successfully
>>> isolated by their mdev driver are in a different group from their
>>> parent device, and therefore need not be affected by whether the
>>> parent device shares a group with some other physical device. They
>>> *might* be, but that's up to the mdev driver to determine based on
>>> what it can safely isolate.
>>>
>> If we understand it as multiple levels of isolation, can we classify the
>> devices into the following categories?
>>
>> 1) Legacy devices
>> - devices without device-level isolation
>> - multiple devices could sit in a single iommu_group
>> - only a single I/O address space could be bound to IOMMU
> I'm not really clear on what that last statement means.

I mean a single iommu_domain should be used by all devices sharing a
single iommu_group.

>
>> 2) Modern devices
>> - devices capable of device-level isolation
> This will*typically* be true of modern devices, but I don't think we
> can really make it a hard API distinction. Legacy or buggy bridges
> can force modern devices into the same group as each other. Modern
> devices are not immune from bugs which would force lack of isolation
> (e.g. forgotten debug registers on function 0 which affect other
> functions).
>

Yes.

I am thinking whether it's feasible to change "bind/attach a device to
an IOASID" to "bind/attach an isolated unit to an IOASID". An isolated
unit could be

1) an iommu_ group including single or multiple devices;
2) a physical device which have a 1-device iommu group + device ID
(PASID/subStreamID) which represents an isolated subdevice inside the
physical one.
3) anything that we might have in the future.

A handler which represents the connection between device and iommu is
returned on any successful binding. This handler could be used to
GET_INFO and attach/detach after binding.

Best regards,
baolu

2021-06-25 10:30:00

by Tian, Kevin

[permalink] [raw]
Subject: RE: Plan for /dev/ioasid RFC v2

Hi, Alex/Joerg/Jason,

Want to draw your attention on an updated proposal below. Let's see
whether there is a converged direction to move forward. ????

> From: Jason Gunthorpe <[email protected]>
> Sent: Saturday, June 19, 2021 2:23 AM
>
> On Fri, Jun 18, 2021 at 04:57:40PM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <[email protected]>
> > > Sent: Friday, June 18, 2021 8:20 AM
> > >
> > > On Thu, Jun 17, 2021 at 03:14:52PM -0600, Alex Williamson wrote:
> > >
> > > > I've referred to this as a limitation of type1, that we can't put
> > > > devices within the same group into different address spaces, such as
> > > > behind separate vRoot-Ports in a vIOMMU config, but really, who cares?
> > > > As isolation support improves we see fewer multi-device groups, this
> > > > scenario becomes the exception. Buy better hardware to use the
> devices
> > > > independently.
> > >
> > > This is basically my thinking too, but my conclusion is that we should
> > > not continue to make groups central to the API.
> > >
> > > As I've explained to David this is actually causing functional
> > > problems and mess - and I don't see a clean way to keep groups central
> > > but still have the device in control of what is happening. We need
> > > this device <-> iommu connection to be direct to robustly model all
> > > the things that are in the RFC.
> > >
> > > To keep groups central someone needs to sketch out how to solve
> > > today's mdev SW page table and mdev PASID issues in a clean
> > > way. Device centric is my suggestion on how to make it clean, but I
> > > haven't heard an alternative??
> > >
> > > So, I view the purpose of this discussion to scope out what a
> > > device-centric world looks like and then if we can securely fit in the
> > > legacy non-isolated world on top of that clean future oriented
> > > API. Then decide if it is work worth doing or not.
> > >
> > > To my mind it looks like it is not so bad, granted not every detail is
> > > clear, and no code has be sketched, but I don't see a big scary
> > > blocker emerging. An extra ioctl or two, some special logic that
> > > activates for >1 device groups that looks a lot like VFIO's current
> > > logic..
> > >
> > > At some level I would be perfectly fine if we made the group FD part
> > > of the API for >1 device groups - except that complexifies every user
> > > space implementation to deal with that. It doesn't feel like a good
> > > trade off.
> > >
> >
> > Would it be an acceptable tradeoff by leaving >1 device groups
> > supported only via legacy VFIO (which is anyway kept for backward
> > compatibility), if we think such scenario is being deprecated over
> > time (thus little value to add new features on it)? Then all new
> > sub-systems including vdpa and new vfio only support singleton
> > device group via /dev/iommu...
>
> That might just be a great idea - userspace has to support those APIs
> anyhow, if it can be made trivially obvious to use this fallback even
> though /dev/iommu is available it is a great place to start. It also
> means PASID/etc are naturally blocked off.
>
> Maybe years down the road we will want to harmonize them, so I would
> still sketch it out enough to be confident it could be implemented..
>

First let's align on the high level goal of supporting multi-devices group
via IOMMU fd. Based on previous discussions I feel it's fair to say that
we will not provide new features beyond what vfio group delivers today,
which implies:

1) All devices within the group must share the same address space.

Though it's possible to support multiple address spaces (e.g. if caused
by !ACS), there are some scenarios (DMA aliasing, RID sharing, etc.)
where a single address space is mandatory. The effort to support
multiple spaces is not worthwhile due to improved isolation over time.

2) It's not necessary to bind all devices within the group to the IOMMU fd.

Other devices could be left unused, or bound to a known driver which
doesn't do DMA. This implies a group viability mechanism must be in
place which can identify when the group is viable for operation and
BUG_ON() when the viability is changed due to user action.

3) User must be denied from accessing a device before its group is attached
to a known security context.

If above goals are agreed, below is the updated proposal for supporting
multi-devices group via device-centric API. Most ideas come from Jason.
Here try to expand and compose them in a full picture.

In general:

- vfio keeps existing uAPI sequence, with slightly different semantics:

a) VFIO_GROUP_SET_CONTAINER, as today

b) VFIO_SET_IOMMU with a new iommu type (VFIO_EXTERNAL_
IOMMU) which, once set, tells VFIO not to establish its own
security context.

c) VFIO_GROUP_GET_DEVICE_FD_NEW, carrying additional info
about external iommu driver (iommu_fd, device_cookie). This
call automatically binds the device to iommu_fd. Device fd is
returned to the user only after successful binding which implies
a security context (BLOCK_DMA) has been established for the
entire group. Since the security context is managed by iommu_fd,
group viable check should be done in the iommu layer thus
vfio_group_viable() mechanism is redundant in this case.

- When receiving the binding call for the 1st device in a group, iommu_fd
calls iommu_group_set_block_dma(group, dev->driver) which does
several things:

a) Check group viability. A group is viable only when all devices in
the group are in one of below states:

* driver-less
* bound to a driver which is same as dev->driver (vfio in this case)
* bound to an otherwise allowed driver (same list as in vfio)

b) Set block_dma flag for the group and configure the IOMMU to block
DMA for all devices in this group. This could be done by attaching to
a dedicated iommu domain (IOMMU_DOMAIN_BLOCKED) which has
an empty page table.

c) The iommu layer also verifies group viability on BUS_NOTIFY_
BOUND_DRIVER event. BUG_ON if viability is broken while block_dma
is set.

- Binding other devices in the group to iommu_fd just succeeds since
the group is already in block_dma.

- When a group is in block_dma state, all devices in the group (even not
bound to iommu_fd) switch together between blocked domain and
IOASID domain, initiated by attaching to or detaching from an IOASID.

a) iommu_fd verifies that all bound devices in the same group must be
attached to a single IOASID.

b) the 1st device attach in the group calls iommu API to move the
entire group to use the new IOASID domain.

c) the last device detach calls iommu API to move the entire group
back to the blocked domain.

- A device is allowed to be unbound from iommu_fd when other devices
in the group are still bound. In this case the group is still in block_dma
state thus the unbound device should not be bound to another driver
which could break the group viability.

a) for vfio this unbound is automatically done when device fd is closed.

- When vfio requests to unbind the last device in the group, iommu_fd
calls iommu_group_unset_block_dma(group) to move the group out
of the block_dma state. Devices in the group are re-attached to the
default domain from now on.

With this design all the helper functions and uAPI are kept device-centric
in iommu_fd. It maintains minimal group knowledge internally by tracking
device binding/attaching status within each group and then calling proper
iommu API upon changed group status.

VFIO still keeps its container/group/device semantics for backward
compatibility.

A new subsystem can completely eliminate group semantics as long as
it could find a way to finish device binding before granting user to
access the device.

Thanks
Kevin

2021-06-25 14:39:02

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: Plan for /dev/ioasid RFC v2

On Fri, Jun 25, 2021 at 10:27:18AM +0000, Tian, Kevin wrote:

> - When receiving the binding call for the 1st device in a group, iommu_fd
> calls iommu_group_set_block_dma(group, dev->driver) which does
> several things:

The whole problem here is trying to match this new world where we want
devices to be in charge of their own IOMMU configuration and the old
world where groups are in charge.

Inserting the group fd and then calling a device-centric
VFIO_GROUP_GET_DEVICE_FD_NEW doesn't solve this conflict, and isn't
necessary. We can always get the group back from the device at any
point in the sequence do to a group wide operation.

What I saw as the appeal of the sort of idea was to just completely
leave all the difficult multi-device-group scenarios behind on the old
group centric API and then we don't have to deal with them at all, or
least not right away.

I'd see some progression where iommu_fd only works with 1:1 groups at
the start. Other scenarios continue with the old API.

Then maybe groups where all devices use the same IOASID.

Then 1:N groups if the source device is reliably identifiable, this
requires iommu subystem work to attach domains to sub-group objects -
not sure it is worthwhile.

But at least we can talk about each step with well thought out patches

The only thing that needs to be done to get the 1:1 step is to broadly
define how the other two cases will work so we don't get into trouble
and set some way to exclude the problematic cases from even getting to
iommu_fd in the first place.

For instance if we go ahead and create /dev/vfio/device nodes we could
do this only if the group was 1:1, otherwise the group cdev has to be
used, along with its API.

> a) Check group viability. A group is viable only when all devices in
> the group are in one of below states:
>
> * driver-less
> * bound to a driver which is same as dev->driver (vfio in this case)
> * bound to an otherwise allowed driver (same list as in vfio)

This really shouldn't use hardwired driver checks. Attached drivers
should generically indicate to the iommu layer that they are safe for
iommu_fd usage by calling some function around probe()

Thus a group must contain only iommu_fd safe drivers, or drivers-less
devices before any of it can be used. It is the more general
refactoring of what VFIO is doing.

> c) The iommu layer also verifies group viability on BUS_NOTIFY_
> BOUND_DRIVER event. BUG_ON if viability is broken while block_dma
> is set.

And with this concept of iommu_fd safety being first-class maybe we
can somehow eliminate this gross BUG_ON (and the 100's of lines of
code that are used to create it) by denying probe to non-iommu-safe
drivers, somehow.

> - Binding other devices in the group to iommu_fd just succeeds since
> the group is already in block_dma.

I think the rest of this more or less describes the device centric
logic for multi-device groups we've already talked about. I don't
think it benifits from having the group fd

Jason

2021-06-28 01:11:32

by Tian, Kevin

[permalink] [raw]
Subject: RE: Plan for /dev/ioasid RFC v2

> From: Jason Gunthorpe <[email protected]>
> Sent: Friday, June 25, 2021 10:36 PM
>
> On Fri, Jun 25, 2021 at 10:27:18AM +0000, Tian, Kevin wrote:
>
> > - When receiving the binding call for the 1st device in a group, iommu_fd
> > calls iommu_group_set_block_dma(group, dev->driver) which does
> > several things:
>
> The whole problem here is trying to match this new world where we want
> devices to be in charge of their own IOMMU configuration and the old
> world where groups are in charge.
>
> Inserting the group fd and then calling a device-centric
> VFIO_GROUP_GET_DEVICE_FD_NEW doesn't solve this conflict, and isn't
> necessary.

No, this was not what I meant. There is no group fd required when
calling this device-centric interface. I was actually talking about:

iommu_group_set_block_dma(dev->group, dev->driver)

just because current iommu layer API is group-centric. Whether this
should be improved could be next-level thing. Sorry for not making
it clear in the first place.

> We can always get the group back from the device at any
> point in the sequence do to a group wide operation.

yes.

>
> What I saw as the appeal of the sort of idea was to just completely
> leave all the difficult multi-device-group scenarios behind on the old
> group centric API and then we don't have to deal with them at all, or
> least not right away.

yes, this is the staged approach that we discussed earlier. and
the reason why I refined this proposal about multi-devices group
here is because you want to see some confidence along this
direction. Thus I expanded your idea and hope to achieve consensus
with Alex/Joerg who obviously have not been convinced yet.

>
> I'd see some progression where iommu_fd only works with 1:1 groups at
> the start. Other scenarios continue with the old API.

One uAPI open after completing this new sketch. v1 proposed to
conduct binding (VFIO_BIND_IOMMU_FD) after device_fd is acquired.
With this sketch we need a new VFIO_GROUP_GET_DEVICE_FD_NEW
to complete both in one step. I want to get Alex's confirmation whether
it sounds good to him, since it's better to unify the uAPI between 1:1
group and 1:N group even if we don't support 1:N in the start.

>
> Then maybe groups where all devices use the same IOASID.
>
> Then 1:N groups if the source device is reliably identifiable, this
> requires iommu subystem work to attach domains to sub-group objects -
> not sure it is worthwhile.
>
> But at least we can talk about each step with well thought out patches
>
> The only thing that needs to be done to get the 1:1 step is to broadly
> define how the other two cases will work so we don't get into trouble
> and set some way to exclude the problematic cases from even getting to
> iommu_fd in the first place.
>
> For instance if we go ahead and create /dev/vfio/device nodes we could
> do this only if the group was 1:1, otherwise the group cdev has to be
> used, along with its API.

I feel for VFIO possibly we don't need significant change to its uAPI
sequence, since it anyway needs to support existing semantics for
backward compatibility. With this sketch we can keep vfio container/
group by introducing an external iommu type which implies a different
GET_DEVICE_FD semantics. /dev/iommu can report a fd-wide capability
for whether 1:N group is supported to vfio user.

For new subsystems they can directly create device nodes and rely on
iommu fd to manage group isolation, without introducing any group
semantics in its uAPI.

>
> > a) Check group viability. A group is viable only when all devices in
> > the group are in one of below states:
> >
> > * driver-less
> > * bound to a driver which is same as dev->driver (vfio in this case)
> > * bound to an otherwise allowed driver (same list as in vfio)
>
> This really shouldn't use hardwired driver checks. Attached drivers
> should generically indicate to the iommu layer that they are safe for
> iommu_fd usage by calling some function around probe()

good idea.

>
> Thus a group must contain only iommu_fd safe drivers, or drivers-less
> devices before any of it can be used. It is the more general
> refactoring of what VFIO is doing.
>
> > c) The iommu layer also verifies group viability on BUS_NOTIFY_
> > BOUND_DRIVER event. BUG_ON if viability is broken while
> block_dma
> > is set.
>
> And with this concept of iommu_fd safety being first-class maybe we
> can somehow eliminate this gross BUG_ON (and the 100's of lines of
> code that are used to create it) by denying probe to non-iommu-safe
> drivers, somehow.

yes.

>
> > - Binding other devices in the group to iommu_fd just succeeds since
> > the group is already in block_dma.
>
> I think the rest of this more or less describes the device centric
> logic for multi-device groups we've already talked about. I don't
> think it benifits from having the group fd
>

sure. All of this new sketch doesn't have group fd in any iommu fd
API. Just try to elaborate a full sketch to sync the base.

Alex/Joerg, look forward to your thoughts now. ????

Thanks
Kevin

2021-06-28 02:05:21

by Tian, Kevin

[permalink] [raw]
Subject: RE: Plan for /dev/ioasid RFC v2

Hi, Jason,

> From: Jason Gunthorpe <[email protected]>
> Sent: Friday, June 25, 2021 10:36 PM
>
> The only thing that needs to be done to get the 1:1 step is to broadly
> define how the other two cases will work so we don't get into trouble
> and set some way to exclude the problematic cases from even getting to
> iommu_fd in the first place.
>
> For instance if we go ahead and create /dev/vfio/device nodes we could
> do this only if the group was 1:1, otherwise the group cdev has to be
> used, along with its API.
>

I may misunderstand your position in last reply.

The bottom line is that iommu fd uAPI and helper functions should be
device-centric (no group fd carried). This is what this sketch tries to
achieve.

However I'm getting confused on your position on vfio uAPIs.

At some point I feel you are OK to keep vfio group fd:

"For others, I don't think this is *strictly* necessary, we can
probably still get to the device_fd using the group_fd and fit in
/dev/ioasid. It does make the rest of this more readable though."
https://lore.kernel.org/linux-iommu/PH0PR12MB54811863B392C644E5365446DC3E9@PH0PR12MB5481.namprd12.prod.outlook.com/T/#m1b1d2b4d6413e3b32ba972a97c2c6a16bf491796

But you are also obviously against faking group for mdev.

Combining with the last paragraph above, are you actually suggesting
that 1:1 group (including mdev) should use a new device-centric vfio
uAPI (without group fd) while existing group-centric vfio uAPI is only
kept for 1:N group (with slight semantics change in my sketch to match
device-centric iommu fd API)?

If yes, some assumptions in this sketch might be changed. For example,
with /dev/vfio/device node I'm not sure how the user can pass {iommu_fd,
device_cookie} to establish the security context when opening the node
(not via an indirect group ioctl). Then it implies that we may have to allow
the user open a device before it is put into a security context, thus the
safe guard may have to be enabled on mmap() for 1:1 group. This is a
different sequence from the existing group-centric model.

Anyway, appreciate if you can elaborate your thoughts so we can further
think about them.

Thanks
Kevin

2021-06-28 06:47:44

by Tian, Kevin

[permalink] [raw]
Subject: RE: Plan for /dev/ioasid RFC v2

> From: Jason Gunthorpe <[email protected]>
> Sent: Friday, June 25, 2021 10:36 PM
>
> On Fri, Jun 25, 2021 at 10:27:18AM +0000, Tian, Kevin wrote:
>
> > - When receiving the binding call for the 1st device in a group, iommu_fd
> > calls iommu_group_set_block_dma(group, dev->driver) which does
> > several things:
>
> The whole problem here is trying to match this new world where we want
> devices to be in charge of their own IOMMU configuration and the old
> world where groups are in charge.
>
> Inserting the group fd and then calling a device-centric
> VFIO_GROUP_GET_DEVICE_FD_NEW doesn't solve this conflict, and isn't
> necessary. We can always get the group back from the device at any
> point in the sequence do to a group wide operation.
>
> What I saw as the appeal of the sort of idea was to just completely
> leave all the difficult multi-device-group scenarios behind on the old
> group centric API and then we don't have to deal with them at all, or
> least not right away.
>
> I'd see some progression where iommu_fd only works with 1:1 groups at
> the start. Other scenarios continue with the old API.
>
> Then maybe groups where all devices use the same IOASID.
>
> Then 1:N groups if the source device is reliably identifiable, this
> requires iommu subystem work to attach domains to sub-group objects -
> not sure it is worthwhile.
>
> But at least we can talk about each step with well thought out patches
>
> The only thing that needs to be done to get the 1:1 step is to broadly
> define how the other two cases will work so we don't get into trouble
> and set some way to exclude the problematic cases from even getting to
> iommu_fd in the first place.
>
> For instance if we go ahead and create /dev/vfio/device nodes we could
> do this only if the group was 1:1, otherwise the group cdev has to be
> used, along with its API.
>

Thinking more along your direction, here is an updated sketch:

[Stage-1]

Multi-devices group (1:N) is handled by existing vfio group fd and
vfio_iommu_type1 driver.

Singleton group (1:1) is handled via a new device-centric protocol:

1) /dev/vfio/device nodes are created for devices in singleton group
or devices w/o group (mdev)

2) user gets iommu_fd by open("/dev/iommu"). A default block_dma
domain is created per iommu_fd (or globally) with an empty I/O
page table.

3) iommu_fd reports that only 1:1 group is supported

4) user gets device_fd by open("/dev/vfio/device"). At this point
mmap() should be blocked since a security context hasn't been
established for this fd. This could be done by returning an error
(EACCESS or EAGAIN?), or succeeding w/o actually setting up the
mapping.

5) user requests to bind device_fd to iommu_fd which verifies the
group is not 1:N (for mdev the check is on the parent device).
Successful binding automatically attaches the device to the block_
dma domain via iommu_attach_group(). From now on the user is
permitted to access the device. If mmap() in 3) is allowed, vfio
actually sets up the MMIO mapping at this point.

6) before the device is unbound from iommu_fd, it is always in a
security context. Attaching/detaching just switches the security
context between the block_dma domain and an ioasid domain.

7) Unbinding detaches the device from the block_dma domain and
re-attach it to the default domain. From now on the user should
be denied from accessing the device. vfio should tear down the
MMIO mapping at this point.

[Stage-2]

Both 1:1 and 1:N groups are handled via the new device-centric protocol.
Old vfio uAPI is kept for legacy applications. All devices in the same group
must share the same I/O address space.

A key difference from stage-1 is the additional check on group viability:

1) vfio creates /dev/vfio/device nodes for all devices

2) Same as stage-1 for getting iommu_fd

3) iommu_fd reports that both 1:1 and 1:N groups are supported

4) Same as stage-1 for getting device_fd

5) when receiving the binding call for the 1st device in a group, iommu
fd does several things:

a) Identify the group of this device and check group viability. A group
is viable only when all devices in the group are in one of below states:

* driver-less
* bound to a driver which is same as the one which does the
binding call (vfio in this case)
* bound to an otherwise allowed driver (which indicates that it
is safe for iommu_fd usage around probe())

b) Attach all devices in the group to the block_dma domain, via existing
iommu_attach_group().

c) Register a notifier callback to verifie group viability on IOMMU_GROUP_
NOTIFY_BOUND_DRIVER event. BUG_ON() might be eliminated if
we can find a way to deny probe of non-iommu-safe drivers.

From now on the user is permitted to access the device. Similar to
stage-1, vfio may set up the MMIO mapping at this point.

6) Binding other devices in the same group just succeed

7) Before the last device in the group is unbound from iommu_fd, all
devices in the group (even not bound to iommu_fd) switch together
between block_dma domain and ioasid domain, initiated by attaching
to or detaching from an ioasid.

a) iommu_fd verifies that all bound devices in the same group must be
attached to a single IOASID.

b) the 1st device attach in the group moves the entire group to use
the new IOASID domain.

c) the last device detach moves the entire group back to the block-dma
domain.

8) A device is allowed to be unbound from iommu_fd when other devices
in the group are still bound. In this case all devices in this group are still
attached to a security context (block-dma or ioasid). vfio may still zap
the mmio mapping (though still in security context) since it doesn't
know group in this new flow. The unbound device should not be bound
to another driver which could break the group viability.

9) When user requests to unbind the last device in the group, iommu_fd
detaches the whole group from the block-dma domain. All mmio mappings
must be zapped immediately. Devices in the group are re-attached to
the default domain from now on (not safe for user to access).

[Stage-3]

It's still an open whether we want to further allow devices within a group
attached to different IOASIDs in case that the source devices are reliably
identifiable. This is an usage not supported by existing vfio and might be
not worthwhile due to improved isolation over time.

When it's required, iommu layer has to create sub-group objects and
expose the sub-group topology to userspace. In the meantime, iommu
API will be extended to allow sub-group attach/detach operations.

In this case, there is no much difference in stage-2 flow. iommu_fd just
needs to understand the sub-group topology when allowing a group of
devices attached to different IOASIDs. The key is still to enforce that
the entire group is in iommu_fd managed security contexts (block-dma or
ioasid) as long as one or more devices in the group are still bound to it.

Thanks
Kevin

2021-06-28 15:07:21

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: Plan for /dev/ioasid RFC v2

On Mon, Jun 28, 2021 at 02:03:56AM +0000, Tian, Kevin wrote:

> Combining with the last paragraph above, are you actually suggesting
> that 1:1 group (including mdev) should use a new device-centric vfio
> uAPI (without group fd) while existing group-centric vfio uAPI is only
> kept for 1:N group (with slight semantics change in my sketch to match
> device-centric iommu fd API)?

Yes, this is one approach

Using a VFIO_GROUP_GET_DEVICE_FD_NEW on the group FD is another
option, but locks us into having the group FD.

Which is better possibly depends on some details when going through
the code transformation, though I prefer not to design assuming the
group FD must exist.

> (not via an indirect group ioctl). Then it implies that we may have to allow
> the user open a device before it is put into a security context, thus the
> safe guard may have to be enabled on mmap() for 1:1 group. This is a
> different sequence from the existing group-centric model.

Yes, but I think this is fairly minor, it would just start with a
dummy fops and move to operational fops once things are setup enough.

Jason

2021-06-28 23:44:21

by Alex Williamson

[permalink] [raw]
Subject: Re: Plan for /dev/ioasid RFC v2

On Mon, 28 Jun 2021 01:09:18 +0000
"Tian, Kevin" <[email protected]> wrote:

> > From: Jason Gunthorpe <[email protected]>
> > Sent: Friday, June 25, 2021 10:36 PM
> >
> > On Fri, Jun 25, 2021 at 10:27:18AM +0000, Tian, Kevin wrote:
> >
> > > - When receiving the binding call for the 1st device in a group, iommu_fd
> > > calls iommu_group_set_block_dma(group, dev->driver) which does
> > > several things:
> >
> > The whole problem here is trying to match this new world where we want
> > devices to be in charge of their own IOMMU configuration and the old
> > world where groups are in charge.
> >
> > Inserting the group fd and then calling a device-centric
> > VFIO_GROUP_GET_DEVICE_FD_NEW doesn't solve this conflict, and isn't
> > necessary.
>
> No, this was not what I meant. There is no group fd required when
> calling this device-centric interface. I was actually talking about:
>
> iommu_group_set_block_dma(dev->group, dev->driver)
>
> just because current iommu layer API is group-centric. Whether this
> should be improved could be next-level thing. Sorry for not making
> it clear in the first place.
>
> > We can always get the group back from the device at any
> > point in the sequence do to a group wide operation.
>
> yes.
>
> >
> > What I saw as the appeal of the sort of idea was to just completely
> > leave all the difficult multi-device-group scenarios behind on the old
> > group centric API and then we don't have to deal with them at all, or
> > least not right away.
>
> yes, this is the staged approach that we discussed earlier. and
> the reason why I refined this proposal about multi-devices group
> here is because you want to see some confidence along this
> direction. Thus I expanded your idea and hope to achieve consensus
> with Alex/Joerg who obviously have not been convinced yet.
>
> >
> > I'd see some progression where iommu_fd only works with 1:1 groups at
> > the start. Other scenarios continue with the old API.
>
> One uAPI open after completing this new sketch. v1 proposed to
> conduct binding (VFIO_BIND_IOMMU_FD) after device_fd is acquired.
> With this sketch we need a new VFIO_GROUP_GET_DEVICE_FD_NEW
> to complete both in one step. I want to get Alex's confirmation whether
> it sounds good to him, since it's better to unify the uAPI between 1:1
> group and 1:N group even if we don't support 1:N in the start.

I don't like it. It doesn't make sense to me. You have the
group-centric world, which must continue to exist and cannot change
because we cannot break the vfio uapi. We can make extensions, we can
define a new parallel uapi, we can deprecate the uapi, but in the short
term, it can't change.

AIUI, the new device-centric model starts with vfio device files that
can be opened directly. So what then is the purpose of a *GROUP* get
device fd? Why is a vfio uapi involved in setting a device cookie for
another subsystem?

I'd expect that /dev/iommu will be used by multiple subsystems. All
will want to bind devices to address spaces, so shouldn't binding a
device to an iommufd be an ioctl on the iommufd, ie.
IOMMU_BIND_VFIO_DEVICE_FD. Maybe we don't even need "VFIO" in there and
the iommufd code can figure it out internally.

You're essentially trying to reduce vfio to the device interface. That
necessarily implies that ioctls on the container, group, or passed
through the container to the iommu no longer exist. From my
perspective, there should ideally be no new vfio ioctls. The user gets
a limited access vfio device fd and enables full access to the device
by registering it to the iommufd subsystem (100% this needs to be
enforced until close() to avoid revoke issues). The user interacts
exclusively with vfio via the device fd and performs all DMA address
space related operations through the iommufd.

> > Then maybe groups where all devices use the same IOASID.
> >
> > Then 1:N groups if the source device is reliably identifiable, this
> > requires iommu subystem work to attach domains to sub-group objects -
> > not sure it is worthwhile.
> >
> > But at least we can talk about each step with well thought out patches
> >
> > The only thing that needs to be done to get the 1:1 step is to broadly
> > define how the other two cases will work so we don't get into trouble
> > and set some way to exclude the problematic cases from even getting to
> > iommu_fd in the first place.
> >
> > For instance if we go ahead and create /dev/vfio/device nodes we could
> > do this only if the group was 1:1, otherwise the group cdev has to be
> > used, along with its API.
>
> I feel for VFIO possibly we don't need significant change to its uAPI
> sequence, since it anyway needs to support existing semantics for
> backward compatibility. With this sketch we can keep vfio container/
> group by introducing an external iommu type which implies a different
> GET_DEVICE_FD semantics. /dev/iommu can report a fd-wide capability
> for whether 1:N group is supported to vfio user.

Ideally vfio would also at least be able to register a type1 IOMMU
backend through the existing uapi, backed by this iommu code, ie. we'd
create a new "iommufd" (but without the user visible fd), bind all the
group devices to it, generating our own device cookies, create a single
ioasid and attach all the devices to it (all internal). When using the
compatibility mode, userspace doesn't get device cookies, doesn't get
an iommufd, they do mappings through the container, where vfio owns the
cookies and ioasid.

> For new subsystems they can directly create device nodes and rely on
> iommu fd to manage group isolation, without introducing any group
> semantics in its uAPI.

Create device nodes, bind them to iommufd, associate cookies, attach
ioasids, etc. That should be the same for all subsystems, including
vfio, it's just the magic internal handshake between the device
subsystem and the iommufd subsystem that changes.

> > > a) Check group viability. A group is viable only when all devices in
> > > the group are in one of below states:
> > >
> > > * driver-less
> > > * bound to a driver which is same as dev->driver (vfio in this case)
> > > * bound to an otherwise allowed driver (same list as in vfio)
> >
> > This really shouldn't use hardwired driver checks. Attached drivers
> > should generically indicate to the iommu layer that they are safe for
> > iommu_fd usage by calling some function around probe()
>
> good idea.
>
> >
> > Thus a group must contain only iommu_fd safe drivers, or drivers-less
> > devices before any of it can be used. It is the more general
> > refactoring of what VFIO is doing.
> >
> > > c) The iommu layer also verifies group viability on BUS_NOTIFY_
> > > BOUND_DRIVER event. BUG_ON if viability is broken while
> > block_dma
> > > is set.
> >
> > And with this concept of iommu_fd safety being first-class maybe we
> > can somehow eliminate this gross BUG_ON (and the 100's of lines of
> > code that are used to create it) by denying probe to non-iommu-safe
> > drivers, somehow.
>
> yes.
>
> >
> > > - Binding other devices in the group to iommu_fd just succeeds since
> > > the group is already in block_dma.
> >
> > I think the rest of this more or less describes the device centric
> > logic for multi-device groups we've already talked about. I don't
> > think it benifits from having the group fd
> >
>
> sure. All of this new sketch doesn't have group fd in any iommu fd
> API. Just try to elaborate a full sketch to sync the base.
>
> Alex/Joerg, look forward to your thoughts now. ????

Some provided. Thanks,

Alex

2021-06-28 23:44:29

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: Plan for /dev/ioasid RFC v2

On Mon, Jun 28, 2021 at 04:31:45PM -0600, Alex Williamson wrote:

> I'd expect that /dev/iommu will be used by multiple subsystems. All
> will want to bind devices to address spaces, so shouldn't binding a
> device to an iommufd be an ioctl on the iommufd, ie.
> IOMMU_BIND_VFIO_DEVICE_FD. Maybe we don't even need "VFIO" in there and
> the iommufd code can figure it out internally.

It wants to be the other way around because iommu_fd is the lower
level subsystem. We don't/can't teach iommu_fd how to convert a fd
number to a vfio/vdpa/etc/etc, we teach all the things building on
iommu_fd how to change a fd number to an iommu - they already
necessarily have an inter-module linkage.

There is a certain niceness to what you are saying but it is not so
practical without doing something bigger..

> Ideally vfio would also at least be able to register a type1 IOMMU
> backend through the existing uapi, backed by this iommu code, ie. we'd
> create a new "iommufd" (but without the user visible fd),

It would be amazing to be able to achieve this, at least for me there
are too many details be able to tell what that would look like
exactly. I suggested once that putting the container ioctl interface
in the drivers/iommu code may allow for this without too much trouble..

Jason

2021-06-28 23:56:21

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: Plan for /dev/ioasid RFC v2

On Mon, Jun 28, 2021 at 06:45:23AM +0000, Tian, Kevin wrote:

> 7) Unbinding detaches the device from the block_dma domain and
> re-attach it to the default domain. From now on the user should
> be denied from accessing the device. vfio should tear down the
> MMIO mapping at this point.

I think we should just forbid this, so long as the device_fd is open
the iommu_fd cannot be destroyed and there is no way to detact a
device other than closing its Fd.

revoke is tricky enough to implement we should avoid it.

> It's still an open whether we want to further allow devices within a group
> attached to different IOASIDs in case that the source devices are reliably
> identifiable. This is an usage not supported by existing vfio and might be
> not worthwhile due to improved isolation over time.

The main decision here is to decide if the uAPI should have some way to
indicate that a device does not have its own unique IOASID but is
sharing with the group

Jason

2021-06-29 00:38:10

by Alex Williamson

[permalink] [raw]
Subject: Re: Plan for /dev/ioasid RFC v2

On Mon, 28 Jun 2021 19:48:18 -0300
Jason Gunthorpe <[email protected]> wrote:

> On Mon, Jun 28, 2021 at 04:31:45PM -0600, Alex Williamson wrote:
>
> > I'd expect that /dev/iommu will be used by multiple subsystems. All
> > will want to bind devices to address spaces, so shouldn't binding a
> > device to an iommufd be an ioctl on the iommufd, ie.
> > IOMMU_BIND_VFIO_DEVICE_FD. Maybe we don't even need "VFIO" in there and
> > the iommufd code can figure it out internally.
>
> It wants to be the other way around because iommu_fd is the lower
> level subsystem. We don't/can't teach iommu_fd how to convert a fd
> number to a vfio/vdpa/etc/etc, we teach all the things building on
> iommu_fd how to change a fd number to an iommu - they already
> necessarily have an inter-module linkage.

These seem like peer subsystems, like vfio and kvm. vfio shouldn't
have any hard dependencies on the iommufd module, especially so long as
we have the legacy type1 code. Likewise iommufd shouldn't have any on
vfio. As much as you dislike the symbol_get hack of the kvm-vfio
device, it would be reasonable for iommufd to reach for a vfio symbol
when an IOMMU_BIND_VFIO_DEVICE_FD ioctl is called.

> There is a certain niceness to what you are saying but it is not so
> practical without doing something bigger..
>
> > Ideally vfio would also at least be able to register a type1 IOMMU
> > backend through the existing uapi, backed by this iommu code, ie. we'd
> > create a new "iommufd" (but without the user visible fd),
>
> It would be amazing to be able to achieve this, at least for me there
> are too many details be able to tell what that would look like
> exactly. I suggested once that putting the container ioctl interface
> in the drivers/iommu code may allow for this without too much trouble..

If we can't achieve this, then type1 legacy code is going to need to
live through an extended deprecation period. I'm hoping that between
type1 and a native interface we'll have two paths into iommufd to vet
the design. Thanks,

Alex

2021-06-29 00:40:19

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: Plan for /dev/ioasid RFC v2

On Mon, Jun 28, 2021 at 05:09:02PM -0600, Alex Williamson wrote:
> On Mon, 28 Jun 2021 19:48:18 -0300
> Jason Gunthorpe <[email protected]> wrote:
>
> > On Mon, Jun 28, 2021 at 04:31:45PM -0600, Alex Williamson wrote:
> >
> > > I'd expect that /dev/iommu will be used by multiple subsystems. All
> > > will want to bind devices to address spaces, so shouldn't binding a
> > > device to an iommufd be an ioctl on the iommufd, ie.
> > > IOMMU_BIND_VFIO_DEVICE_FD. Maybe we don't even need "VFIO" in there and
> > > the iommufd code can figure it out internally.
> >
> > It wants to be the other way around because iommu_fd is the lower
> > level subsystem. We don't/can't teach iommu_fd how to convert a fd
> > number to a vfio/vdpa/etc/etc, we teach all the things building on
> > iommu_fd how to change a fd number to an iommu - they already
> > necessarily have an inter-module linkage.
>
> These seem like peer subsystems, like vfio and kvm. vfio shouldn't
> have any hard dependencies on the iommufd module, especially so long as
> we have the legacy type1 code.

It does, the vfio_device implementation has to tell the iommu subsystem
what kind of device behavior it has and possibly interact with the
iommu subsystem with it in cases like PASID. This was outlined in part
of the RFC.

In any event a module dependency from vfio to iommu is not bothersome,
while the other way certainly is.

> Likewise iommufd shouldn't have any on vfio. As much as you
> dislike the symbol_get hack of the kvm-vfio device, it would be
> reasonable for iommufd to reach for a vfio symbol when an
> IOMMU_BIND_VFIO_DEVICE_FD ioctl is called.

We'd have to add a special ioctl to iommu for every new subsystem, it
doesn't scale. iommu is a core subsystem, vfio is a driver subsystem.
The direction of dependency is clear, I think.

Jason

2021-06-29 00:42:58

by Tian, Kevin

[permalink] [raw]
Subject: RE: Plan for /dev/ioasid RFC v2

> From: Jason Gunthorpe
> Sent: Tuesday, June 29, 2021 7:13 AM
>
> On Mon, Jun 28, 2021 at 05:09:02PM -0600, Alex Williamson wrote:
> > On Mon, 28 Jun 2021 19:48:18 -0300
> > Jason Gunthorpe <[email protected]> wrote:
> >
> > > On Mon, Jun 28, 2021 at 04:31:45PM -0600, Alex Williamson wrote:
> > >
> > > > I'd expect that /dev/iommu will be used by multiple subsystems. All
> > > > will want to bind devices to address spaces, so shouldn't binding a
> > > > device to an iommufd be an ioctl on the iommufd, ie.
> > > > IOMMU_BIND_VFIO_DEVICE_FD. Maybe we don't even need "VFIO" in
> there and
> > > > the iommufd code can figure it out internally.
> > >
> > > It wants to be the other way around because iommu_fd is the lower
> > > level subsystem. We don't/can't teach iommu_fd how to convert a fd
> > > number to a vfio/vdpa/etc/etc, we teach all the things building on
> > > iommu_fd how to change a fd number to an iommu - they already
> > > necessarily have an inter-module linkage.
> >
> > These seem like peer subsystems, like vfio and kvm. vfio shouldn't
> > have any hard dependencies on the iommufd module, especially so long as
> > we have the legacy type1 code.
>
> It does, the vfio_device implementation has to tell the iommu subsystem
> what kind of device behavior it has and possibly interact with the
> iommu subsystem with it in cases like PASID. This was outlined in part
> of the RFC.

Right. PASID is managed by specific device driver in this RFC and provided
as routing information to iommu_fd when the device is attached to an
IOASID. Another point is about PASID virtualization (vPASID->pPASID),
which is established by having the user to register its vPASID when doing
the attach call. vfio device driver needs to use this information in the
mediation path. In concept vPASID is not relevant to iommu_fd which only
cares about pPASID. Having vPASID registered via iommu_fd uAPI and
then indirectly communicated to vfio device driver looks not a clean
way in the first place.

Thanks
Kevin

2021-06-29 00:44:32

by Tian, Kevin

[permalink] [raw]
Subject: RE: Plan for /dev/ioasid RFC v2

> From: Alex Williamson <[email protected]>
> Sent: Tuesday, June 29, 2021 6:32 AM
>
> On Mon, 28 Jun 2021 01:09:18 +0000
> "Tian, Kevin" <[email protected]> wrote:
>
> > > From: Jason Gunthorpe <[email protected]>
> > > Sent: Friday, June 25, 2021 10:36 PM
> > >
> > > On Fri, Jun 25, 2021 at 10:27:18AM +0000, Tian, Kevin wrote:
> > >
> > > > - When receiving the binding call for the 1st device in a group,
> iommu_fd
> > > > calls iommu_group_set_block_dma(group, dev->driver) which does
> > > > several things:
> > >
> > > The whole problem here is trying to match this new world where we want
> > > devices to be in charge of their own IOMMU configuration and the old
> > > world where groups are in charge.
> > >
> > > Inserting the group fd and then calling a device-centric
> > > VFIO_GROUP_GET_DEVICE_FD_NEW doesn't solve this conflict, and isn't
> > > necessary.
> >
> > No, this was not what I meant. There is no group fd required when
> > calling this device-centric interface. I was actually talking about:
> >
> > iommu_group_set_block_dma(dev->group, dev->driver)
> >
> > just because current iommu layer API is group-centric. Whether this
> > should be improved could be next-level thing. Sorry for not making
> > it clear in the first place.
> >
> > > We can always get the group back from the device at any
> > > point in the sequence do to a group wide operation.
> >
> > yes.
> >
> > >
> > > What I saw as the appeal of the sort of idea was to just completely
> > > leave all the difficult multi-device-group scenarios behind on the old
> > > group centric API and then we don't have to deal with them at all, or
> > > least not right away.
> >
> > yes, this is the staged approach that we discussed earlier. and
> > the reason why I refined this proposal about multi-devices group
> > here is because you want to see some confidence along this
> > direction. Thus I expanded your idea and hope to achieve consensus
> > with Alex/Joerg who obviously have not been convinced yet.
> >
> > >
> > > I'd see some progression where iommu_fd only works with 1:1 groups at
> > > the start. Other scenarios continue with the old API.
> >
> > One uAPI open after completing this new sketch. v1 proposed to
> > conduct binding (VFIO_BIND_IOMMU_FD) after device_fd is acquired.
> > With this sketch we need a new VFIO_GROUP_GET_DEVICE_FD_NEW
> > to complete both in one step. I want to get Alex's confirmation whether
> > it sounds good to him, since it's better to unify the uAPI between 1:1
> > group and 1:N group even if we don't support 1:N in the start.
>
> I don't like it. It doesn't make sense to me. You have the
> group-centric world, which must continue to exist and cannot change
> because we cannot break the vfio uapi. We can make extensions, we can
> define a new parallel uapi, we can deprecate the uapi, but in the short
> term, it can't change.
>
> AIUI, the new device-centric model starts with vfio device files that
> can be opened directly. So what then is the purpose of a *GROUP* get
> device fd? Why is a vfio uapi involved in setting a device cookie for
> another subsystem?
>
> I'd expect that /dev/iommu will be used by multiple subsystems. All
> will want to bind devices to address spaces, so shouldn't binding a
> device to an iommufd be an ioctl on the iommufd, ie.
> IOMMU_BIND_VFIO_DEVICE_FD. Maybe we don't even need "VFIO" in there
> and
> the iommufd code can figure it out internally.
>
> You're essentially trying to reduce vfio to the device interface. That
> necessarily implies that ioctls on the container, group, or passed
> through the container to the iommu no longer exist. From my
> perspective, there should ideally be no new vfio ioctls. The user gets
> a limited access vfio device fd and enables full access to the device
> by registering it to the iommufd subsystem (100% this needs to be
> enforced until close() to avoid revoke issues). The user interacts
> exclusively with vfio via the device fd and performs all DMA address
> space related operations through the iommufd.
>
> > > Then maybe groups where all devices use the same IOASID.
> > >
> > > Then 1:N groups if the source device is reliably identifiable, this
> > > requires iommu subystem work to attach domains to sub-group objects -
> > > not sure it is worthwhile.
> > >
> > > But at least we can talk about each step with well thought out patches
> > >
> > > The only thing that needs to be done to get the 1:1 step is to broadly
> > > define how the other two cases will work so we don't get into trouble
> > > and set some way to exclude the problematic cases from even getting to
> > > iommu_fd in the first place.
> > >
> > > For instance if we go ahead and create /dev/vfio/device nodes we could
> > > do this only if the group was 1:1, otherwise the group cdev has to be
> > > used, along with its API.
> >
> > I feel for VFIO possibly we don't need significant change to its uAPI
> > sequence, since it anyway needs to support existing semantics for
> > backward compatibility. With this sketch we can keep vfio container/
> > group by introducing an external iommu type which implies a different
> > GET_DEVICE_FD semantics. /dev/iommu can report a fd-wide capability
> > for whether 1:N group is supported to vfio user.
>
> Ideally vfio would also at least be able to register a type1 IOMMU
> backend through the existing uapi, backed by this iommu code, ie. we'd
> create a new "iommufd" (but without the user visible fd), bind all the
> group devices to it, generating our own device cookies, create a single
> ioasid and attach all the devices to it (all internal). When using the
> compatibility mode, userspace doesn't get device cookies, doesn't get
> an iommufd, they do mappings through the container, where vfio owns the
> cookies and ioasid.
>
> > For new subsystems they can directly create device nodes and rely on
> > iommu fd to manage group isolation, without introducing any group
> > semantics in its uAPI.
>
> Create device nodes, bind them to iommufd, associate cookies, attach
> ioasids, etc. That should be the same for all subsystems, including
> vfio, it's just the magic internal handshake between the device
> subsystem and the iommufd subsystem that changes.
>
> > > > a) Check group viability. A group is viable only when all devices in
> > > > the group are in one of below states:
> > > >
> > > > * driver-less
> > > > * bound to a driver which is same as dev->driver (vfio in this
> case)
> > > > * bound to an otherwise allowed driver (same list as in vfio)
> > >
> > > This really shouldn't use hardwired driver checks. Attached drivers
> > > should generically indicate to the iommu layer that they are safe for
> > > iommu_fd usage by calling some function around probe()
> >
> > good idea.
> >
> > >
> > > Thus a group must contain only iommu_fd safe drivers, or drivers-less
> > > devices before any of it can be used. It is the more general
> > > refactoring of what VFIO is doing.
> > >
> > > > c) The iommu layer also verifies group viability on BUS_NOTIFY_
> > > > BOUND_DRIVER event. BUG_ON if viability is broken while
> > > block_dma
> > > > is set.
> > >
> > > And with this concept of iommu_fd safety being first-class maybe we
> > > can somehow eliminate this gross BUG_ON (and the 100's of lines of
> > > code that are used to create it) by denying probe to non-iommu-safe
> > > drivers, somehow.
> >
> > yes.
> >
> > >
> > > > - Binding other devices in the group to iommu_fd just succeeds since
> > > > the group is already in block_dma.
> > >
> > > I think the rest of this more or less describes the device centric
> > > logic for multi-device groups we've already talked about. I don't
> > > think it benifits from having the group fd
> > >
> >
> > sure. All of this new sketch doesn't have group fd in any iommu fd
> > API. Just try to elaborate a full sketch to sync the base.
> >
> > Alex/Joerg, look forward to your thoughts now. ????
>
> Some provided. Thanks,
>

Thanks a lot Alex! We'll try to focus on the new device-centric flow
w/o touching existing container/group uAPI. As you said, we need
a brand-new mechanism for all subsystems anyway.

With that I will resume v2 progress based on device-centric concept.
It will be still based on a few new VFIO uAPIs to handle device binding/
attaching, though you prefer to not adding any new VFIO uAPI. This is
relatively a smaller open compared to device-centric vs. group-centric
issue. We can have it continuously discussed in parallel with v2 review.
and I hope v2 can be helpful for closing this open with a clearer
explanation about PASID virtualization. ????

Thanks
Kevin

2021-06-29 00:44:52

by Tian, Kevin

[permalink] [raw]
Subject: RE: Plan for /dev/ioasid RFC v2

> From: Alex Williamson <[email protected]>
> Sent: Tuesday, June 29, 2021 7:09 AM
>
> > > Ideally vfio would also at least be able to register a type1 IOMMU
> > > backend through the existing uapi, backed by this iommu code, ie. we'd
> > > create a new "iommufd" (but without the user visible fd),
> >
> > It would be amazing to be able to achieve this, at least for me there
> > are too many details be able to tell what that would look like
> > exactly. I suggested once that putting the container ioctl interface
> > in the drivers/iommu code may allow for this without too much trouble..
>
> If we can't achieve this, then type1 legacy code is going to need to
> live through an extended deprecation period. I'm hoping that between
> type1 and a native interface we'll have two paths into iommufd to vet
> the design. Thanks,
>

I prefer to keeping type1 legacy code for some time. After the new
device-centric flow works, we can then study how to make type1
as a shim layer atop.

Thanks
Kevin