by Jason Gunthorpe

[permalink] [raw]

Subject: Re: [PATCH v3 24/30] vfio-pci/zdev: wire up group notifier

On Thu, Feb 10, 2022 at 03:06:35PM +0100, Niklas Schnelle wrote:

> > How does the page pinning work?
>
> The pinning is done directly in the RPCIT interception handler pinning
> both the IOMMU tables and the guest pages mapped for DMA.

And if pinning fails?

> > Then the
> > magic kernel code you describe can operate on its own domain without
> > becoming confused with a normal map/unmap domain.
>
> This sounds like an interesting idea. Looking at
> drivers/iommu/s390_iommu.c most of that is pretty trivial domain
> handling. I wonder if we could share this by marking the existing
> s390_iommu_domain type with kind of a "lent out to KVM" flag.

Lu has posted a series here:

https://lore.kernel.org/linux-iommu/[email protected]

Which allows the iommu driver to create a domain with unique ops, so
you'd just fork the entire thing, have your own struct
s390_kvm_iommu_domain and related ops.

When the special creation flow is triggered you'd just create one of
these with the proper ops already setup.

We are imagining a special ioctl to create these things and each IOMMU
HW driver can supply a unique implementation suited to their HW
design.

> KVM RPCIT intercept and vice versa. I.e. while the domain is under
> control of KVM's RPCIT handling we make all IOMMU map/unmap fail.

It is not "under the control of" the domain would be created as linked
to kvm and would never, ever, be anything else.

> To me this more direct involvement of IOMMU and KVM on s390x is also a
> direct consequence of it using special instructions. Naturally those
> instructions can be intercepted or run under hardware accelerated
> virtualization.

Well, no, you've just created a kernel-side SW emulated nested
translation scheme. Other CPUs have talked about doing this too, but
nobody has attempted it.

You can make the same argument for any CPU's scheme, a trapped mmio
store is not fundamentally any different from a special instruction
that traps, other than how the information is transferred.

> Yes very good analogy. Has any of that nested IOMMU translations work
> been merged yet?

No. We are making quiet progress, slowly though. I'll add your
interest to my list

> too. Basically we would then execute RPCIT without leaving the
> hardware virtualization mode (SIE). We believe that that would
> require pinning all of guest memory though because HW can't really
> pin pages.

Right, this is what other iommu HW will have to do.

Jason

2022-02-14 15:00:17

by Pierre Morel

[permalink] [raw]

Subject: Re: [PATCH v3 19/30] KVM: s390: pci: provide routines for enabling/disabling interpretation

On 2/4/22 22:15, Matthew Rosato wrote:
> These routines will be wired into the vfio_pci_zdev ioctl handlers to
> respond to requests to enable / disable a device for zPCI Load/Store
> interpretation.
>
> The first time such a request is received, enable the necessary facilities
> for the guest.
>
> Signed-off-by: Matthew Rosato <[email protected]>
> ---
> arch/s390/include/asm/kvm_pci.h | 4 ++
> arch/s390/kvm/pci.c | 102 ++++++++++++++++++++++++++++++++
> arch/s390/pci/pci.c | 3 +
> 3 files changed, 109 insertions(+)
>
> diff --git a/arch/s390/include/asm/kvm_pci.h b/arch/s390/include/asm/kvm_pci.h
> index ef10f9e46e37..422701d526dd 100644
> --- a/arch/s390/include/asm/kvm_pci.h
> +++ b/arch/s390/include/asm/kvm_pci.h
> @@ -24,4 +24,8 @@ struct kvm_zdev {
> int kvm_s390_pci_dev_open(struct zpci_dev *zdev);
> void kvm_s390_pci_dev_release(struct zpci_dev *zdev);
>
> +int kvm_s390_pci_interp_probe(struct zpci_dev *zdev);
> +int kvm_s390_pci_interp_enable(struct zpci_dev *zdev);
> +int kvm_s390_pci_interp_disable(struct zpci_dev *zdev);
> +
> #endif /* ASM_KVM_PCI_H */
> diff --git a/arch/s390/kvm/pci.c b/arch/s390/kvm/pci.c
> index 9b8390133e15..16bef3935284 100644
> --- a/arch/s390/kvm/pci.c
> +++ b/arch/s390/kvm/pci.c
> @@ -12,7 +12,9 @@
> #include <asm/kvm_pci.h>
> #include <asm/pci.h>
> #include <asm/pci_insn.h>
> +#include <asm/sclp.h>
> #include "pci.h"
> +#include "kvm-s390.h"
>
> struct zpci_aift *aift;
>
> @@ -153,6 +155,106 @@ int kvm_s390_pci_aen_init(u8 nisc)
> return rc;
> }
>
> +int kvm_s390_pci_interp_probe(struct zpci_dev *zdev)
> +{
> + /* Must have appropriate hardware facilities */
> + if (!sclp.has_zpci_lsi || !sclp.has_aisii || !sclp.has_aeni ||
> + !sclp.has_aisi || !test_facility(69) || !test_facility(70) ||
> + !test_facility(71) || !test_facility(72)) {
> + return -EINVAL;
> + }

I do not think we need to check STFLE facilities when the SCLP bit
indicating the interpretation of a facility is installed the STFLE bit
indicating the interpreted facility is also installed.

Otherwise, look good to me, with this change:
Reviewed-by: Pierre Morel <[email protected]>

> +
> + /* Must have a KVM association registered */
> + if (!zdev->kzdev || !zdev->kzdev->kvm)
> + return -EINVAL;
> +
> + return 0;
> +}
> +EXPORT_SYMBOL_GPL(kvm_s390_pci_interp_probe);
> +
> +int kvm_s390_pci_interp_enable(struct zpci_dev *zdev)
> +{
> + u32 gisa;
> + int rc;
> +
> + if (!zdev->kzdev || !zdev->kzdev->kvm)
> + return -EINVAL;
> +
> + /*
> + * If this is the first request to use an interpreted device, make the
> + * necessary vcpu changes
> + */
> + if (!zdev->kzdev->kvm->arch.use_zpci_interp)
> + kvm_s390_vcpu_pci_enable_interp(zdev->kzdev->kvm);
> +
> + /*
> + * In the event of a system reset in userspace, the GISA designation
> + * may still be assigned because the device is still enabled.
> + * Verify it's the same guest before proceeding.
> + */
> + gisa = (u32)virt_to_phys(&zdev->kzdev->kvm->arch.sie_page2->gisa);
> + if (zdev->gisa != 0 && zdev->gisa != gisa)
> + return -EPERM;
> +
> + if (zdev_enabled(zdev)) {
> + zdev->gisa = 0;
> + rc = zpci_disable_device(zdev);
> + if (rc)
> + return rc;
> + }
> +
> + /*
> + * Store information about the identity of the kvm guest allowed to
> + * access this device via interpretation to be used by host CLP
> + */
> + zdev->gisa = gisa;
> +
> + rc = zpci_enable_device(zdev);
> + if (rc)
> + goto err;
> +
> + /* Re-register the IOMMU that was already created */
> + rc = zpci_register_ioat(zdev, 0, zdev->start_dma, zdev->end_dma,
> + virt_to_phys(zdev->dma_table));
> + if (rc)
> + goto err;
> +
> + return rc;
> +
> +err:
> + zdev->gisa = 0;
> + return rc;
> +}
> +EXPORT_SYMBOL_GPL(kvm_s390_pci_interp_enable);
> +
> +int kvm_s390_pci_interp_disable(struct zpci_dev *zdev)
> +{
> + int rc;
> +
> + if (zdev->gisa == 0)
> + return -EINVAL;
> +
> + /* Remove the host CLP guest designation */
> + zdev->gisa = 0;
> +
> + if (zdev_enabled(zdev)) {
> + rc = zpci_disable_device(zdev);
> + if (rc)
> + return rc;
> + }
> +
> + rc = zpci_enable_device(zdev);
> + if (rc)
> + return rc;
> +
> + /* Re-register the IOMMU that was already created */
> + rc = zpci_register_ioat(zdev, 0, zdev->start_dma, zdev->end_dma,
> + virt_to_phys(zdev->dma_table));
> +
> + return rc;
> +}
> +EXPORT_SYMBOL_GPL(kvm_s390_pci_interp_disable);
> +
> int kvm_s390_pci_dev_open(struct zpci_dev *zdev)
> {
> struct kvm_zdev *kzdev;
> diff --git a/arch/s390/pci/pci.c b/arch/s390/pci/pci.c
> index 13033717cd4e..5dbe49ec325e 100644
> --- a/arch/s390/pci/pci.c
> +++ b/arch/s390/pci/pci.c
> @@ -147,6 +147,7 @@ int zpci_register_ioat(struct zpci_dev *zdev, u8 dmaas,
> zpci_dbg(3, "reg ioat fid:%x, cc:%d, status:%d\n", zdev->fid, cc, status);
> return cc;
> }
> +EXPORT_SYMBOL_GPL(zpci_register_ioat);
>
> /* Modify PCI: Unregister I/O address translation parameters */
> int zpci_unregister_ioat(struct zpci_dev *zdev, u8 dmaas)
> @@ -727,6 +728,7 @@ int zpci_enable_device(struct zpci_dev *zdev)
> zpci_update_fh(zdev, fh);
> return rc;
> }
> +EXPORT_SYMBOL_GPL(zpci_enable_device);
>
> int zpci_disable_device(struct zpci_dev *zdev)
> {
> @@ -750,6 +752,7 @@ int zpci_disable_device(struct zpci_dev *zdev)
> }
> return rc;
> }
> +EXPORT_SYMBOL_GPL(zpci_disable_device);
>
> /**
> * zpci_hot_reset_device - perform a reset of the given zPCI function
>

--
Pierre Morel
IBM Lab Boeblingen