2022-02-07 06:37:11

by Matthew Rosato

[permalink] [raw]
Subject: [PATCH v3 00/30] KVM: s390: enable zPCI for interpretive execution

Enable interpretive execution of zPCI instructions + adapter interruption
forwarding for s390x KVM vfio-pci. This is done by introducing a series
of new vfio-pci feature ioctls that are unique vfio-pci-zdev (s390x) and
are used to negotiate the various aspects of zPCI interpretation setup.
By allowing intepretation of zPCI instructions and firmware delivery of
interrupts to guests, we can significantly reduce the frequency of guest
SIE exits for zPCI. We then see additional gains by handling a hot-path
instruction that can still intercept to the hypervisor (RPCIT) directly
in kvm.

From the perspective of guest configuration, you passthrough zPCI devices
in the same manner as before, with intepretation support being used by
default if available in kernel+qemu.

Will reply with a link to the associated QEMU series.

Changelog v2->v3:
- More R-bs / ACKs (Thanks!)
- Re-word patch 6 commit message (Claudio)
- Patch 8 + some later patches: s/gd/gisa/ (Pierre)
- Patch 12: remove !mdd check (Pierre)
- some more virt/phys conversions (Pierre)
- Patch 18: check more sclp bits & facilities during interp probe (Pierre)
- Patch 21: fix fabricated status for some RPCIT intercept errors
- Patch 25-27: remove get/set checks from feature ioctl handlers as they
are already done in vfio core (Pierre)
- Patch 26: s/aif/aif_float/ and s/fhost/aif_fhost/
- remove kvm_s390_pci_attach_kvm and just do the work inline (Pierre)
- Use CONFIG_VFIO_PCI_ZDEV instead of CONFIG_PCI in Makefile and other
code locations (Pierre)
- Due to the above, re-arrange series order so CONFIG_VFIO_PCI_ZDEV is
introduced earlier
- s/aift->lock/aift->aift_lock/ (Pierre)
- Break some AEN init code into local functions zpci_setup_aipb() and
zpci_reset_aipb() (Pierre)
- check for errors on kvm_s390_gisc_register (Pierre)
- handle airq clear errors differently when we know the device is being
removed vs any other reason aif is being disabled (Pierre)
- s/ioat->lock/ioat->ioat_lock/ (Pierre)
- Fix backout case in kvm_s390_pci_ioat_enable, re-arrange rc settings
slightly (Pierre)
- Add a CONFIG_VFIO_PCI_ZDEV check when determining if its safe to allow
KVM_S390_VM_CPU_FEAT_ZPCI_INTERP (need both the facilities and the
kvm/pci.o pieces to allow intepretation)

Matthew Rosato (30):
s390/sclp: detect the zPCI load/store interpretation facility
s390/sclp: detect the AISII facility
s390/sclp: detect the AENI facility
s390/sclp: detect the AISI facility
s390/airq: pass more TPI info to airq handlers
s390/airq: allow for airq structure that uses an input vector
s390/pci: externalize the SIC operation controls and routine
s390/pci: stash associated GISA designation
s390/pci: export some routines related to RPCIT processing
s390/pci: stash dtsm and maxstbl
s390/pci: add helper function to find device by handle
s390/pci: get SHM information from list pci
s390/pci: return status from zpci_refresh_trans
vfio/pci: re-introduce CONFIG_VFIO_PCI_ZDEV
KVM: s390: pci: add basic kvm_zdev structure
KVM: s390: pci: do initial setup for AEN interpretation
KVM: s390: pci: enable host forwarding of Adapter Event Notifications
KVM: s390: mechanism to enable guest zPCI Interpretation
KVM: s390: pci: provide routines for enabling/disabling interpretation
KVM: s390: pci: provide routines for enabling/disabling interrupt
forwarding
KVM: s390: pci: provide routines for enabling/disabling IOAT assist
KVM: s390: pci: handle refresh of PCI translations
KVM: s390: intercept the rpcit instruction
vfio-pci/zdev: wire up group notifier
vfio-pci/zdev: wire up zPCI interpretive execution support
vfio-pci/zdev: wire up zPCI adapter interrupt forwarding support
vfio-pci/zdev: wire up zPCI IOAT assist support
vfio-pci/zdev: add DTSM to clp group capability
KVM: s390: introduce CPU feature for zPCI Interpretation
MAINTAINERS: additional files related kvm s390 pci passthrough

MAINTAINERS | 2 +
arch/s390/include/asm/airq.h | 7 +-
arch/s390/include/asm/kvm_host.h | 5 +
arch/s390/include/asm/kvm_pci.h | 60 +++
arch/s390/include/asm/pci.h | 12 +
arch/s390/include/asm/pci_clp.h | 11 +-
arch/s390/include/asm/pci_dma.h | 3 +
arch/s390/include/asm/pci_insn.h | 31 +-
arch/s390/include/asm/sclp.h | 4 +
arch/s390/include/asm/tpi.h | 13 +
arch/s390/include/uapi/asm/kvm.h | 1 +
arch/s390/kvm/Makefile | 1 +
arch/s390/kvm/interrupt.c | 95 +++-
arch/s390/kvm/kvm-s390.c | 57 ++-
arch/s390/kvm/kvm-s390.h | 10 +
arch/s390/kvm/pci.c | 850 +++++++++++++++++++++++++++++++
arch/s390/kvm/pci.h | 60 +++
arch/s390/kvm/priv.c | 49 ++
arch/s390/pci/pci.c | 31 ++
arch/s390/pci/pci_clp.c | 28 +-
arch/s390/pci/pci_dma.c | 7 +-
arch/s390/pci/pci_insn.c | 15 +-
arch/s390/pci/pci_irq.c | 48 +-
drivers/iommu/s390-iommu.c | 4 +-
drivers/s390/char/sclp_early.c | 4 +
drivers/s390/cio/airq.c | 12 +-
drivers/s390/cio/qdio_thinint.c | 6 +-
drivers/s390/crypto/ap_bus.c | 9 +-
drivers/s390/virtio/virtio_ccw.c | 6 +-
drivers/vfio/pci/Kconfig | 11 +
drivers/vfio/pci/Makefile | 2 +-
drivers/vfio/pci/vfio_pci_core.c | 8 +
drivers/vfio/pci/vfio_pci_zdev.c | 275 +++++++++-
include/linux/vfio_pci_core.h | 42 +-
include/uapi/linux/vfio.h | 22 +
include/uapi/linux/vfio_zdev.h | 51 ++
36 files changed, 1787 insertions(+), 65 deletions(-)
create mode 100644 arch/s390/include/asm/kvm_pci.h
create mode 100644 arch/s390/kvm/pci.c
create mode 100644 arch/s390/kvm/pci.h

--
2.27.0



2022-02-07 07:49:57

by Matthew Rosato

[permalink] [raw]
Subject: [PATCH v3 13/30] s390/pci: return status from zpci_refresh_trans

Current callers of zpci_refresh_trans don't need to interrogate the status
returned from the underlying instructions. However, a subsequent patch
will add a KVM caller that needs this information. Add a new argument to
zpci_refresh_trans to pass the address of a status byte and update
existing call sites to provide it.

Reviewed-by: Pierre Morel <[email protected]>
Reviewed-by: Claudio Imbrenda <[email protected]>
Reviewed-by: Niklas Schnelle <[email protected]>
Signed-off-by: Matthew Rosato <[email protected]>
---
arch/s390/include/asm/pci_insn.h | 2 +-
arch/s390/pci/pci_dma.c | 6 ++++--
arch/s390/pci/pci_insn.c | 10 +++++-----
drivers/iommu/s390-iommu.c | 4 +++-
4 files changed, 13 insertions(+), 9 deletions(-)

diff --git a/arch/s390/include/asm/pci_insn.h b/arch/s390/include/asm/pci_insn.h
index 5331082fa516..32759c407b8f 100644
--- a/arch/s390/include/asm/pci_insn.h
+++ b/arch/s390/include/asm/pci_insn.h
@@ -135,7 +135,7 @@ union zpci_sic_iib {
DECLARE_STATIC_KEY_FALSE(have_mio);

u8 zpci_mod_fc(u64 req, struct zpci_fib *fib, u8 *status);
-int zpci_refresh_trans(u64 fn, u64 addr, u64 range);
+int zpci_refresh_trans(u64 fn, u64 addr, u64 range, u8 *status);
int __zpci_load(u64 *data, u64 req, u64 offset);
int zpci_load(u64 *data, const volatile void __iomem *addr, unsigned long len);
int __zpci_store(u64 data, u64 req, u64 offset);
diff --git a/arch/s390/pci/pci_dma.c b/arch/s390/pci/pci_dma.c
index a81de48d5ea7..b0a2380bcad8 100644
--- a/arch/s390/pci/pci_dma.c
+++ b/arch/s390/pci/pci_dma.c
@@ -23,8 +23,9 @@ static u32 s390_iommu_aperture_factor = 1;

static int zpci_refresh_global(struct zpci_dev *zdev)
{
+ u8 status;
return zpci_refresh_trans((u64) zdev->fh << 32, zdev->start_dma,
- zdev->iommu_pages * PAGE_SIZE);
+ zdev->iommu_pages * PAGE_SIZE, &status);
}

unsigned long *dma_alloc_cpu_table(void)
@@ -183,6 +184,7 @@ static int __dma_purge_tlb(struct zpci_dev *zdev, dma_addr_t dma_addr,
size_t size, int flags)
{
unsigned long irqflags;
+ u8 status;
int ret;

/*
@@ -201,7 +203,7 @@ static int __dma_purge_tlb(struct zpci_dev *zdev, dma_addr_t dma_addr,
}

ret = zpci_refresh_trans((u64) zdev->fh << 32, dma_addr,
- PAGE_ALIGN(size));
+ PAGE_ALIGN(size), &status);
if (ret == -ENOMEM && !s390_iommu_strict) {
/* enable the hypervisor to free some resources */
if (zpci_refresh_global(zdev))
diff --git a/arch/s390/pci/pci_insn.c b/arch/s390/pci/pci_insn.c
index 0509554301c7..ca6399d52767 100644
--- a/arch/s390/pci/pci_insn.c
+++ b/arch/s390/pci/pci_insn.c
@@ -77,20 +77,20 @@ static inline u8 __rpcit(u64 fn, u64 addr, u64 range, u8 *status)
return cc;
}

-int zpci_refresh_trans(u64 fn, u64 addr, u64 range)
+int zpci_refresh_trans(u64 fn, u64 addr, u64 range, u8 *status)
{
- u8 cc, status;
+ u8 cc;

do {
- cc = __rpcit(fn, addr, range, &status);
+ cc = __rpcit(fn, addr, range, status);
if (cc == 2)
udelay(ZPCI_INSN_BUSY_DELAY);
} while (cc == 2);

if (cc)
- zpci_err_insn(cc, status, addr, range);
+ zpci_err_insn(cc, *status, addr, range);

- if (cc == 1 && (status == 4 || status == 16))
+ if (cc == 1 && (*status == 4 || *status == 16))
return -ENOMEM;

return (cc) ? -EIO : 0;
diff --git a/drivers/iommu/s390-iommu.c b/drivers/iommu/s390-iommu.c
index 50860ebdd087..845bb99c183e 100644
--- a/drivers/iommu/s390-iommu.c
+++ b/drivers/iommu/s390-iommu.c
@@ -214,6 +214,7 @@ static int s390_iommu_update_trans(struct s390_domain *s390_domain,
unsigned long irq_flags, nr_pages, i;
unsigned long *entry;
int rc = 0;
+ u8 status;

if (dma_addr < s390_domain->domain.geometry.aperture_start ||
dma_addr + size > s390_domain->domain.geometry.aperture_end)
@@ -238,7 +239,8 @@ static int s390_iommu_update_trans(struct s390_domain *s390_domain,
spin_lock(&s390_domain->list_lock);
list_for_each_entry(domain_device, &s390_domain->devices, list) {
rc = zpci_refresh_trans((u64) domain_device->zdev->fh << 32,
- start_dma_addr, nr_pages * PAGE_SIZE);
+ start_dma_addr, nr_pages * PAGE_SIZE,
+ &status);
if (rc)
break;
}
--
2.27.0


2022-02-07 10:52:17

by Matthew Rosato

[permalink] [raw]
Subject: [PATCH v3 06/30] s390/airq: allow for airq structure that uses an input vector

When doing device passthrough where interrupts are being forwarded from
host to guest, we wish to use a pinned section of guest memory as the
vector (the same memory used by the guest as the vector). To accomplish
this, add a new parameter for airq_iv_create which allows passing an
existing vector to be used instead of allocating a new one. The caller
is responsible for ensuring the vector is pinned in memory as well as for
unpinning the memory when the vector is no longer needed.

A subsequent patch will use this new parameter for zPCI interpretation.

Reviewed-by: Pierre Morel <[email protected]>
Signed-off-by: Matthew Rosato <[email protected]>
---
arch/s390/include/asm/airq.h | 4 +++-
arch/s390/pci/pci_irq.c | 8 ++++----
drivers/s390/cio/airq.c | 10 +++++++---
drivers/s390/virtio/virtio_ccw.c | 2 +-
4 files changed, 15 insertions(+), 9 deletions(-)

diff --git a/arch/s390/include/asm/airq.h b/arch/s390/include/asm/airq.h
index 7918a7d09028..e82e5626e139 100644
--- a/arch/s390/include/asm/airq.h
+++ b/arch/s390/include/asm/airq.h
@@ -47,8 +47,10 @@ struct airq_iv {
#define AIRQ_IV_PTR 4 /* Allocate the ptr array */
#define AIRQ_IV_DATA 8 /* Allocate the data array */
#define AIRQ_IV_CACHELINE 16 /* Cacheline alignment for the vector */
+#define AIRQ_IV_GUESTVEC 32 /* Vector is a pinned guest page */

-struct airq_iv *airq_iv_create(unsigned long bits, unsigned long flags);
+struct airq_iv *airq_iv_create(unsigned long bits, unsigned long flags,
+ unsigned long *vec);
void airq_iv_release(struct airq_iv *iv);
unsigned long airq_iv_alloc(struct airq_iv *iv, unsigned long num);
void airq_iv_free(struct airq_iv *iv, unsigned long bit, unsigned long num);
diff --git a/arch/s390/pci/pci_irq.c b/arch/s390/pci/pci_irq.c
index cc4c8d7c8f5c..0d0a02a9fbbf 100644
--- a/arch/s390/pci/pci_irq.c
+++ b/arch/s390/pci/pci_irq.c
@@ -296,7 +296,7 @@ int arch_setup_msi_irqs(struct pci_dev *pdev, int nvec, int type)
zdev->aisb = bit;

/* Create adapter interrupt vector */
- zdev->aibv = airq_iv_create(msi_vecs, AIRQ_IV_DATA | AIRQ_IV_BITLOCK);
+ zdev->aibv = airq_iv_create(msi_vecs, AIRQ_IV_DATA | AIRQ_IV_BITLOCK, NULL);
if (!zdev->aibv)
return -ENOMEM;

@@ -419,7 +419,7 @@ static int __init zpci_directed_irq_init(void)
union zpci_sic_iib iib = {{0}};
unsigned int cpu;

- zpci_sbv = airq_iv_create(num_possible_cpus(), 0);
+ zpci_sbv = airq_iv_create(num_possible_cpus(), 0, NULL);
if (!zpci_sbv)
return -ENOMEM;

@@ -441,7 +441,7 @@ static int __init zpci_directed_irq_init(void)
zpci_ibv[cpu] = airq_iv_create(cache_line_size() * BITS_PER_BYTE,
AIRQ_IV_DATA |
AIRQ_IV_CACHELINE |
- (!cpu ? AIRQ_IV_ALLOC : 0));
+ (!cpu ? AIRQ_IV_ALLOC : 0), NULL);
if (!zpci_ibv[cpu])
return -ENOMEM;
}
@@ -458,7 +458,7 @@ static int __init zpci_floating_irq_init(void)
if (!zpci_ibv)
return -ENOMEM;

- zpci_sbv = airq_iv_create(ZPCI_NR_DEVICES, AIRQ_IV_ALLOC);
+ zpci_sbv = airq_iv_create(ZPCI_NR_DEVICES, AIRQ_IV_ALLOC, NULL);
if (!zpci_sbv)
goto out_free;

diff --git a/drivers/s390/cio/airq.c b/drivers/s390/cio/airq.c
index 2f2226786319..375a58b1c838 100644
--- a/drivers/s390/cio/airq.c
+++ b/drivers/s390/cio/airq.c
@@ -122,10 +122,12 @@ static inline unsigned long iv_size(unsigned long bits)
* airq_iv_create - create an interrupt vector
* @bits: number of bits in the interrupt vector
* @flags: allocation flags
+ * @vec: pointer to pinned guest memory if AIRQ_IV_GUESTVEC
*
* Returns a pointer to an interrupt vector structure
*/
-struct airq_iv *airq_iv_create(unsigned long bits, unsigned long flags)
+struct airq_iv *airq_iv_create(unsigned long bits, unsigned long flags,
+ unsigned long *vec)
{
struct airq_iv *iv;
unsigned long size;
@@ -146,6 +148,8 @@ struct airq_iv *airq_iv_create(unsigned long bits, unsigned long flags)
&iv->vector_dma);
if (!iv->vector)
goto out_free;
+ } else if (flags & AIRQ_IV_GUESTVEC) {
+ iv->vector = vec;
} else {
iv->vector = cio_dma_zalloc(size);
if (!iv->vector)
@@ -185,7 +189,7 @@ struct airq_iv *airq_iv_create(unsigned long bits, unsigned long flags)
kfree(iv->avail);
if (iv->flags & AIRQ_IV_CACHELINE && iv->vector)
dma_pool_free(airq_iv_cache, iv->vector, iv->vector_dma);
- else
+ else if (!(iv->flags & AIRQ_IV_GUESTVEC))
cio_dma_free(iv->vector, size);
kfree(iv);
out:
@@ -204,7 +208,7 @@ void airq_iv_release(struct airq_iv *iv)
kfree(iv->bitlock);
if (iv->flags & AIRQ_IV_CACHELINE)
dma_pool_free(airq_iv_cache, iv->vector, iv->vector_dma);
- else
+ else if (!(iv->flags & AIRQ_IV_GUESTVEC))
cio_dma_free(iv->vector, iv_size(iv->bits));
kfree(iv->avail);
kfree(iv);
diff --git a/drivers/s390/virtio/virtio_ccw.c b/drivers/s390/virtio/virtio_ccw.c
index 52c376d15978..410498d693f8 100644
--- a/drivers/s390/virtio/virtio_ccw.c
+++ b/drivers/s390/virtio/virtio_ccw.c
@@ -241,7 +241,7 @@ static struct airq_info *new_airq_info(int index)
return NULL;
rwlock_init(&info->lock);
info->aiv = airq_iv_create(VIRTIO_IV_BITS, AIRQ_IV_ALLOC | AIRQ_IV_PTR
- | AIRQ_IV_CACHELINE);
+ | AIRQ_IV_CACHELINE, NULL);
if (!info->aiv) {
kfree(info);
return NULL;
--
2.27.0


2022-02-07 11:29:03

by Matthew Rosato

[permalink] [raw]
Subject: [PATCH v3 12/30] s390/pci: get SHM information from list pci

KVM will need information on the special handle mask used to indicate
emulated devices. In order to obtain this, a new type of list pci call
must be made to gather the information. Extend clp_list_pci_req to
also fetch the model-dependent-data field that holds this mask.

Reviewed-by: Niklas Schnelle <[email protected]>
Signed-off-by: Matthew Rosato <[email protected]>
---
arch/s390/include/asm/pci.h | 1 +
arch/s390/include/asm/pci_clp.h | 2 +-
arch/s390/pci/pci_clp.c | 25 ++++++++++++++++++++++---
3 files changed, 24 insertions(+), 4 deletions(-)

diff --git a/arch/s390/include/asm/pci.h b/arch/s390/include/asm/pci.h
index 3c0b9986dcdc..e8a3fd5bc169 100644
--- a/arch/s390/include/asm/pci.h
+++ b/arch/s390/include/asm/pci.h
@@ -227,6 +227,7 @@ int clp_enable_fh(struct zpci_dev *zdev, u32 *fh, u8 nr_dma_as);
int clp_disable_fh(struct zpci_dev *zdev, u32 *fh);
int clp_get_state(u32 fid, enum zpci_state *state);
int clp_refresh_fh(u32 fid, u32 *fh);
+int zpci_get_mdd(u32 *mdd);

/* UID */
void update_uid_checking(bool new);
diff --git a/arch/s390/include/asm/pci_clp.h b/arch/s390/include/asm/pci_clp.h
index d6189ed14f84..dc2041e97de4 100644
--- a/arch/s390/include/asm/pci_clp.h
+++ b/arch/s390/include/asm/pci_clp.h
@@ -76,7 +76,7 @@ struct clp_req_list_pci {
struct clp_rsp_list_pci {
struct clp_rsp_hdr hdr;
u64 resume_token;
- u32 reserved2;
+ u32 mdd;
u16 max_fn;
u8 : 7;
u8 uid_checking : 1;
diff --git a/arch/s390/pci/pci_clp.c b/arch/s390/pci/pci_clp.c
index dc733b58e74f..7477956be632 100644
--- a/arch/s390/pci/pci_clp.c
+++ b/arch/s390/pci/pci_clp.c
@@ -328,7 +328,7 @@ int clp_disable_fh(struct zpci_dev *zdev, u32 *fh)
}

static int clp_list_pci_req(struct clp_req_rsp_list_pci *rrb,
- u64 *resume_token, int *nentries)
+ u64 *resume_token, int *nentries, u32 *mdd)
{
int rc;

@@ -354,6 +354,8 @@ static int clp_list_pci_req(struct clp_req_rsp_list_pci *rrb,
*nentries = (rrb->response.hdr.len - LIST_PCI_HDR_LEN) /
rrb->response.entry_size;
*resume_token = rrb->response.resume_token;
+ if (mdd)
+ *mdd = rrb->response.mdd;

return rc;
}
@@ -365,7 +367,7 @@ static int clp_list_pci(struct clp_req_rsp_list_pci *rrb, void *data,
int nentries, i, rc;

do {
- rc = clp_list_pci_req(rrb, &resume_token, &nentries);
+ rc = clp_list_pci_req(rrb, &resume_token, &nentries, NULL);
if (rc)
return rc;
for (i = 0; i < nentries; i++)
@@ -383,7 +385,7 @@ static int clp_find_pci(struct clp_req_rsp_list_pci *rrb, u32 fid,
int nentries, i, rc;

do {
- rc = clp_list_pci_req(rrb, &resume_token, &nentries);
+ rc = clp_list_pci_req(rrb, &resume_token, &nentries, NULL);
if (rc)
return rc;
fh_list = rrb->response.fh_list;
@@ -468,6 +470,23 @@ int clp_get_state(u32 fid, enum zpci_state *state)
return rc;
}

+int zpci_get_mdd(u32 *mdd)
+{
+ struct clp_req_rsp_list_pci *rrb;
+ u64 resume_token = 0;
+ int nentries, rc;
+
+ rrb = clp_alloc_block(GFP_KERNEL);
+ if (!rrb)
+ return -ENOMEM;
+
+ rc = clp_list_pci_req(rrb, &resume_token, &nentries, mdd);
+
+ clp_free_block(rrb);
+ return rc;
+}
+EXPORT_SYMBOL_GPL(zpci_get_mdd);
+
static int clp_base_slpc(struct clp_req *req, struct clp_req_rsp_slpc *lpcb)
{
unsigned long limit = PAGE_SIZE - sizeof(lpcb->request);
--
2.27.0


2022-02-07 11:29:11

by Matthew Rosato

[permalink] [raw]
Subject: [PATCH v3 16/30] KVM: s390: pci: do initial setup for AEN interpretation

Initial setup for Adapter Event Notification Interpretation for zPCI
passthrough devices. Specifically, allocate a structure for forwarding of
adapter events and pass the address of this structure to firmware.

Signed-off-by: Matthew Rosato <[email protected]>
---
arch/s390/include/asm/pci.h | 4 +
arch/s390/include/asm/pci_insn.h | 12 +++
arch/s390/kvm/interrupt.c | 14 +++
arch/s390/kvm/kvm-s390.c | 9 ++
arch/s390/kvm/pci.c | 154 +++++++++++++++++++++++++++++++
arch/s390/kvm/pci.h | 42 +++++++++
arch/s390/pci/pci.c | 6 ++
7 files changed, 241 insertions(+)
create mode 100644 arch/s390/kvm/pci.h

diff --git a/arch/s390/include/asm/pci.h b/arch/s390/include/asm/pci.h
index 4faff673078b..1ae49330d1c8 100644
--- a/arch/s390/include/asm/pci.h
+++ b/arch/s390/include/asm/pci.h
@@ -9,6 +9,7 @@
#include <asm-generic/pci.h>
#include <asm/pci_clp.h>
#include <asm/pci_debug.h>
+#include <asm/pci_insn.h>
#include <asm/sclp.h>

#define PCIBIOS_MIN_IO 0x1000
@@ -204,6 +205,9 @@ extern const struct attribute_group *zpci_attr_groups[];
extern unsigned int s390_pci_force_floating __initdata;
extern unsigned int s390_pci_no_rid;

+extern union zpci_sic_iib *zpci_aipb;
+extern struct airq_iv *zpci_aif_sbv;
+
/* -----------------------------------------------------------------------------
Prototypes
----------------------------------------------------------------------------- */
diff --git a/arch/s390/include/asm/pci_insn.h b/arch/s390/include/asm/pci_insn.h
index 32759c407b8f..ad9000295c82 100644
--- a/arch/s390/include/asm/pci_insn.h
+++ b/arch/s390/include/asm/pci_insn.h
@@ -101,6 +101,7 @@ struct zpci_fib {
/* Set Interruption Controls Operation Controls */
#define SIC_IRQ_MODE_ALL 0
#define SIC_IRQ_MODE_SINGLE 1
+#define SIC_SET_AENI_CONTROLS 2
#define SIC_IRQ_MODE_DIRECT 4
#define SIC_IRQ_MODE_D_ALL 16
#define SIC_IRQ_MODE_D_SINGLE 17
@@ -127,9 +128,20 @@ struct zpci_cdiib {
u64 : 64;
} __packed __aligned(8);

+/* adapter interruption parameters block */
+struct zpci_aipb {
+ u64 faisb;
+ u64 gait;
+ u16 : 13;
+ u16 afi : 3;
+ u32 : 32;
+ u16 faal;
+} __packed __aligned(8);
+
union zpci_sic_iib {
struct zpci_diib diib;
struct zpci_cdiib cdiib;
+ struct zpci_aipb aipb;
};

DECLARE_STATIC_KEY_FALSE(have_mio);
diff --git a/arch/s390/kvm/interrupt.c b/arch/s390/kvm/interrupt.c
index 65e75ca2fc5d..5e638f7c86f8 100644
--- a/arch/s390/kvm/interrupt.c
+++ b/arch/s390/kvm/interrupt.c
@@ -32,6 +32,7 @@
#include "kvm-s390.h"
#include "gaccess.h"
#include "trace-s390.h"
+#include "pci.h"

#define PFAULT_INIT 0x0600
#define PFAULT_DONE 0x0680
@@ -3286,6 +3287,11 @@ void kvm_s390_gib_destroy(void)
{
if (!gib)
return;
+ if (IS_ENABLED(CONFIG_VFIO_PCI_ZDEV) && sclp.has_aeni && aift) {
+ mutex_lock(&aift->aift_lock);
+ kvm_s390_pci_aen_exit();
+ mutex_unlock(&aift->aift_lock);
+ }
chsc_sgib(0);
unregister_adapter_interrupt(&gib_alert_irq);
free_page((unsigned long)gib);
@@ -3323,6 +3329,14 @@ int kvm_s390_gib_init(u8 nisc)
goto out_unreg_gal;
}

+ if (IS_ENABLED(CONFIG_VFIO_PCI_ZDEV) && sclp.has_aeni) {
+ if (kvm_s390_pci_aen_init(nisc)) {
+ pr_err("Initializing AEN for PCI failed\n");
+ rc = -EIO;
+ goto out_unreg_gal;
+ }
+ }
+
KVM_EVENT(3, "gib 0x%pK (nisc=%d) initialized", gib, gib->nisc);
goto out;

diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index 577f1ead6a51..dd4f4bfb326b 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -48,6 +48,7 @@
#include <asm/fpu/api.h>
#include "kvm-s390.h"
#include "gaccess.h"
+#include "pci.h"

#define CREATE_TRACE_POINTS
#include "trace.h"
@@ -503,6 +504,14 @@ int kvm_arch_init(void *opaque)
goto out;
}

+ if (IS_ENABLED(CONFIG_VFIO_PCI_ZDEV)) {
+ rc = kvm_s390_pci_init();
+ if (rc) {
+ pr_err("Unable to allocate AIFT for PCI\n");
+ goto out;
+ }
+ }
+
rc = kvm_s390_gib_init(GAL_ISC);
if (rc)
goto out;
diff --git a/arch/s390/kvm/pci.c b/arch/s390/kvm/pci.c
index bc1b7a564e55..9b8390133e15 100644
--- a/arch/s390/kvm/pci.c
+++ b/arch/s390/kvm/pci.c
@@ -10,6 +10,148 @@
#include <linux/kvm_host.h>
#include <linux/pci.h>
#include <asm/kvm_pci.h>
+#include <asm/pci.h>
+#include <asm/pci_insn.h>
+#include "pci.h"
+
+struct zpci_aift *aift;
+
+static inline int __set_irq_noiib(u16 ctl, u8 isc)
+{
+ union zpci_sic_iib iib = {{0}};
+
+ return zpci_set_irq_ctrl(ctl, isc, &iib);
+}
+
+/* Caller must hold the aift lock before calling this function */
+void kvm_s390_pci_aen_exit(void)
+{
+ unsigned long flags;
+ struct kvm_zdev **gait_kzdev;
+
+ /*
+ * Contents of the aipb remain registered for the life of the host
+ * kernel, the information preserved in zpci_aipb and zpci_aif_sbv
+ * in case we insert the KVM module again later. Clear the AIFT
+ * information and free anything not registered with underlying
+ * firmware.
+ */
+ spin_lock_irqsave(&aift->gait_lock, flags);
+ gait_kzdev = aift->kzdev;
+ aift->gait = 0;
+ aift->sbv = 0;
+ aift->kzdev = 0;
+ spin_unlock_irqrestore(&aift->gait_lock, flags);
+
+ kfree(gait_kzdev);
+}
+
+static int zpci_setup_aipb(u8 nisc)
+{
+ struct page *page;
+ int size, rc;
+
+ zpci_aipb = kzalloc(sizeof(union zpci_sic_iib), GFP_KERNEL);
+ if (!zpci_aipb)
+ return -ENOMEM;
+
+ aift->sbv = airq_iv_create(ZPCI_NR_DEVICES, AIRQ_IV_ALLOC, 0);
+ if (!aift->sbv) {
+ rc = -ENOMEM;
+ goto free_aipb;
+ }
+ zpci_aif_sbv = aift->sbv;
+ size = get_order(PAGE_ALIGN(ZPCI_NR_DEVICES *
+ sizeof(struct zpci_gaite)));
+ page = alloc_pages(GFP_KERNEL | __GFP_ZERO, size);
+ if (!page) {
+ rc = -ENOMEM;
+ goto free_sbv;
+ }
+ aift->gait = (struct zpci_gaite *)page_to_phys(page);
+
+ zpci_aipb->aipb.faisb = virt_to_phys(aift->sbv->vector);
+ zpci_aipb->aipb.gait = virt_to_phys(aift->gait);
+ zpci_aipb->aipb.afi = nisc;
+ zpci_aipb->aipb.faal = ZPCI_NR_DEVICES;
+
+ /* Setup Adapter Event Notification Interpretation */
+ if (zpci_set_irq_ctrl(SIC_SET_AENI_CONTROLS, 0, zpci_aipb)) {
+ rc = -EIO;
+ goto free_gait;
+ }
+
+ return 0;
+
+free_gait:
+ size = get_order(PAGE_ALIGN(ZPCI_NR_DEVICES *
+ sizeof(struct zpci_gaite)));
+ free_pages((unsigned long)aift->gait, size);
+free_sbv:
+ airq_iv_release(aift->sbv);
+ zpci_aif_sbv = 0;
+free_aipb:
+ kfree(zpci_aipb);
+ zpci_aipb = 0;
+
+ return rc;
+}
+
+static int zpci_reset_aipb(u8 nisc)
+{
+ /*
+ * AEN registration can only happen once per system boot. If
+ * an aipb already exists then AEN was already registered and
+ * we can re-use the aipb contents. This can only happen if
+ * the KVM module was removed and re-inserted.
+ */
+ if (zpci_aipb->aipb.faal != ZPCI_NR_DEVICES ||
+ zpci_aipb->aipb.afi != nisc) {
+ return -EINVAL;
+ }
+ aift->sbv = zpci_aif_sbv;
+ aift->gait = (struct zpci_gaite *)zpci_aipb->aipb.gait;
+
+ return 0;
+}
+
+int kvm_s390_pci_aen_init(u8 nisc)
+{
+ int rc = 0;
+
+ /* If already enabled for AEN, bail out now */
+ if (aift->gait || aift->sbv)
+ return -EPERM;
+
+ mutex_lock(&aift->aift_lock);
+ aift->kzdev = kcalloc(ZPCI_NR_DEVICES, sizeof(struct kvm_zdev),
+ GFP_KERNEL);
+ if (!aift->kzdev) {
+ rc = -ENOMEM;
+ goto unlock;
+ }
+
+ if (!zpci_aipb)
+ rc = zpci_setup_aipb(nisc);
+ else
+ rc = zpci_reset_aipb(nisc);
+ if (rc)
+ goto free_zdev;
+
+ /* Enable floating IRQs */
+ if (__set_irq_noiib(SIC_IRQ_MODE_SINGLE, nisc)) {
+ rc = -EIO;
+ kvm_s390_pci_aen_exit();
+ }
+
+ goto unlock;
+
+free_zdev:
+ kfree(aift->kzdev);
+unlock:
+ mutex_unlock(&aift->aift_lock);
+ return rc;
+}

int kvm_s390_pci_dev_open(struct zpci_dev *zdev)
{
@@ -36,3 +178,15 @@ void kvm_s390_pci_dev_release(struct zpci_dev *zdev)
kfree(kzdev);
}
EXPORT_SYMBOL_GPL(kvm_s390_pci_dev_release);
+
+int kvm_s390_pci_init(void)
+{
+ aift = kzalloc(sizeof(struct zpci_aift), GFP_KERNEL);
+ if (!aift)
+ return -ENOMEM;
+
+ spin_lock_init(&aift->gait_lock);
+ mutex_init(&aift->aift_lock);
+
+ return 0;
+}
diff --git a/arch/s390/kvm/pci.h b/arch/s390/kvm/pci.h
new file mode 100644
index 000000000000..53e9968707c8
--- /dev/null
+++ b/arch/s390/kvm/pci.h
@@ -0,0 +1,42 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * s390 kvm PCI passthrough support
+ *
+ * Copyright IBM Corp. 2021
+ *
+ * Author(s): Matthew Rosato <[email protected]>
+ */
+
+#ifndef __KVM_S390_PCI_H
+#define __KVM_S390_PCI_H
+
+#include <linux/pci.h>
+#include <linux/mutex.h>
+#include <asm/airq.h>
+#include <asm/kvm_pci.h>
+
+struct zpci_gaite {
+ u32 gisa;
+ u8 gisc;
+ u8 count;
+ u8 reserved;
+ u8 aisbo;
+ u64 aisb;
+};
+
+struct zpci_aift {
+ struct zpci_gaite *gait;
+ struct airq_iv *sbv;
+ struct kvm_zdev **kzdev;
+ spinlock_t gait_lock; /* Protects the gait, used during AEN forward */
+ struct mutex aift_lock; /* Protects the other structures in aift */
+};
+
+extern struct zpci_aift *aift;
+
+int kvm_s390_pci_aen_init(u8 nisc);
+void kvm_s390_pci_aen_exit(void);
+
+int kvm_s390_pci_init(void);
+
+#endif /* __KVM_S390_PCI_H */
diff --git a/arch/s390/pci/pci.c b/arch/s390/pci/pci.c
index 04c16312ad54..13033717cd4e 100644
--- a/arch/s390/pci/pci.c
+++ b/arch/s390/pci/pci.c
@@ -61,6 +61,12 @@ DEFINE_STATIC_KEY_FALSE(have_mio);

static struct kmem_cache *zdev_fmb_cache;

+/* AEN structures that must be preserved over KVM module re-insertion */
+union zpci_sic_iib *zpci_aipb;
+EXPORT_SYMBOL_GPL(zpci_aipb);
+struct airq_iv *zpci_aif_sbv;
+EXPORT_SYMBOL_GPL(zpci_aif_sbv);
+
struct zpci_dev *get_zdev_by_fid(u32 fid)
{
struct zpci_dev *tmp, *zdev = NULL;
--
2.27.0


2022-02-07 11:30:29

by Matthew Rosato

[permalink] [raw]
Subject: [PATCH v3 23/30] KVM: s390: intercept the rpcit instruction

For faster handling of PCI translation refreshes, intercept in KVM
and call the associated handler.

Signed-off-by: Matthew Rosato <[email protected]>
---
arch/s390/kvm/priv.c | 49 ++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 49 insertions(+)

diff --git a/arch/s390/kvm/priv.c b/arch/s390/kvm/priv.c
index 417154b314a6..d6a5784d8dde 100644
--- a/arch/s390/kvm/priv.c
+++ b/arch/s390/kvm/priv.c
@@ -29,6 +29,7 @@
#include <asm/ap.h>
#include "gaccess.h"
#include "kvm-s390.h"
+#include "pci.h"
#include "trace.h"

static int handle_ri(struct kvm_vcpu *vcpu)
@@ -335,6 +336,52 @@ static int handle_rrbe(struct kvm_vcpu *vcpu)
return 0;
}

+static int handle_rpcit(struct kvm_vcpu *vcpu)
+{
+ int reg1, reg2;
+ u8 status;
+ int rc;
+
+ if (vcpu->arch.sie_block->gpsw.mask & PSW_MASK_PSTATE)
+ return kvm_s390_inject_program_int(vcpu, PGM_PRIVILEGED_OP);
+
+ /*
+ * If the host doesn't support PCI passthrough interpretation, it must
+ * be an emulated device.
+ */
+ if (!IS_ENABLED(CONFIG_VFIO_PCI_ZDEV))
+ return -EOPNOTSUPP;
+
+ kvm_s390_get_regs_rre(vcpu, &reg1, &reg2);
+
+ /* If the device has a SHM bit on, let userspace take care of this */
+ if (((vcpu->run->s.regs.gprs[reg1] >> 32) & aift->mdd) != 0)
+ return -EOPNOTSUPP;
+
+ rc = kvm_s390_pci_refresh_trans(vcpu, vcpu->run->s.regs.gprs[reg1],
+ vcpu->run->s.regs.gprs[reg2],
+ vcpu->run->s.regs.gprs[reg2 + 1],
+ &status);
+
+ switch (rc) {
+ case 0:
+ kvm_s390_set_psw_cc(vcpu, 0);
+ break;
+ case -EOPNOTSUPP:
+ return -EOPNOTSUPP;
+ default:
+ vcpu->run->s.regs.gprs[reg1] &= 0xffffffff00ffffffUL;
+ vcpu->run->s.regs.gprs[reg1] |= (u64)status << 24;
+ if (status != 0)
+ kvm_s390_set_psw_cc(vcpu, 1);
+ else
+ kvm_s390_set_psw_cc(vcpu, 3);
+ break;
+ }
+
+ return 0;
+}
+
#define SSKE_NQ 0x8
#define SSKE_MR 0x4
#define SSKE_MC 0x2
@@ -1275,6 +1322,8 @@ int kvm_s390_handle_b9(struct kvm_vcpu *vcpu)
return handle_essa(vcpu);
case 0xaf:
return handle_pfmf(vcpu);
+ case 0xd3:
+ return handle_rpcit(vcpu);
default:
return -EOPNOTSUPP;
}
--
2.27.0


2022-02-07 11:38:04

by Matthew Rosato

[permalink] [raw]
Subject: [PATCH v3 24/30] vfio-pci/zdev: wire up group notifier

KVM zPCI passthrough device logic will need a reference to the associated
kvm guest that has access to the device. Let's register a group notifier
for VFIO_GROUP_NOTIFY_SET_KVM to catch this information in order to create
an association between a kvm guest and the host zdev.

Signed-off-by: Matthew Rosato <[email protected]>
---
arch/s390/include/asm/kvm_pci.h | 2 ++
drivers/vfio/pci/vfio_pci_core.c | 2 ++
drivers/vfio/pci/vfio_pci_zdev.c | 46 ++++++++++++++++++++++++++++++++
include/linux/vfio_pci_core.h | 10 +++++++
4 files changed, 60 insertions(+)

diff --git a/arch/s390/include/asm/kvm_pci.h b/arch/s390/include/asm/kvm_pci.h
index e4696f5592e1..16290b4cf2a6 100644
--- a/arch/s390/include/asm/kvm_pci.h
+++ b/arch/s390/include/asm/kvm_pci.h
@@ -16,6 +16,7 @@
#include <linux/kvm.h>
#include <linux/pci.h>
#include <linux/mutex.h>
+#include <linux/notifier.h>
#include <asm/pci_insn.h>
#include <asm/pci_dma.h>

@@ -32,6 +33,7 @@ struct kvm_zdev {
u64 rpcit_count;
struct kvm_zdev_ioat ioat;
struct zpci_fib fib;
+ struct notifier_block nb;
};

int kvm_s390_pci_dev_open(struct zpci_dev *zdev);
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index f948e6cd2993..fc57d4d0abbe 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -452,6 +452,7 @@ void vfio_pci_core_close_device(struct vfio_device *core_vdev)

vfio_pci_vf_token_user_add(vdev, -1);
vfio_spapr_pci_eeh_release(vdev->pdev);
+ vfio_pci_zdev_release(vdev);
vfio_pci_core_disable(vdev);

mutex_lock(&vdev->igate);
@@ -470,6 +471,7 @@ EXPORT_SYMBOL_GPL(vfio_pci_core_close_device);
void vfio_pci_core_finish_enable(struct vfio_pci_core_device *vdev)
{
vfio_pci_probe_mmaps(vdev);
+ vfio_pci_zdev_open(vdev);
vfio_spapr_pci_eeh_open(vdev->pdev);
vfio_pci_vf_token_user_add(vdev, 1);
}
diff --git a/drivers/vfio/pci/vfio_pci_zdev.c b/drivers/vfio/pci/vfio_pci_zdev.c
index ea4c0d2b0663..9f8284499111 100644
--- a/drivers/vfio/pci/vfio_pci_zdev.c
+++ b/drivers/vfio/pci/vfio_pci_zdev.c
@@ -13,6 +13,7 @@
#include <linux/vfio_zdev.h>
#include <asm/pci_clp.h>
#include <asm/pci_io.h>
+#include <asm/kvm_pci.h>

#include <linux/vfio_pci_core.h>

@@ -136,3 +137,48 @@ int vfio_pci_info_zdev_add_caps(struct vfio_pci_core_device *vdev,

return ret;
}
+
+static int vfio_pci_zdev_group_notifier(struct notifier_block *nb,
+ unsigned long action, void *data)
+{
+ struct kvm_zdev *kzdev = container_of(nb, struct kvm_zdev, nb);
+
+ if (action == VFIO_GROUP_NOTIFY_SET_KVM) {
+ if (!data || !kzdev->zdev)
+ return NOTIFY_DONE;
+ kzdev->kvm = data;
+ }
+
+ return NOTIFY_OK;
+}
+
+void vfio_pci_zdev_open(struct vfio_pci_core_device *vdev)
+{
+ unsigned long events = VFIO_GROUP_NOTIFY_SET_KVM;
+ struct zpci_dev *zdev = to_zpci(vdev->pdev);
+
+ if (!zdev)
+ return;
+
+ if (kvm_s390_pci_dev_open(zdev))
+ return;
+
+ zdev->kzdev->nb.notifier_call = vfio_pci_zdev_group_notifier;
+
+ if (vfio_register_notifier(vdev->vdev.dev, VFIO_GROUP_NOTIFY,
+ &events, &zdev->kzdev->nb))
+ kvm_s390_pci_dev_release(zdev);
+}
+
+void vfio_pci_zdev_release(struct vfio_pci_core_device *vdev)
+{
+ struct zpci_dev *zdev = to_zpci(vdev->pdev);
+
+ if (!zdev || !zdev->kzdev)
+ return;
+
+ vfio_unregister_notifier(vdev->vdev.dev, VFIO_GROUP_NOTIFY,
+ &zdev->kzdev->nb);
+
+ kvm_s390_pci_dev_release(zdev);
+}
diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
index 5e2bca3b89db..05287f8ac855 100644
--- a/include/linux/vfio_pci_core.h
+++ b/include/linux/vfio_pci_core.h
@@ -198,12 +198,22 @@ static inline int vfio_pci_igd_init(struct vfio_pci_core_device *vdev)
#ifdef CONFIG_VFIO_PCI_ZDEV
extern int vfio_pci_info_zdev_add_caps(struct vfio_pci_core_device *vdev,
struct vfio_info_cap *caps);
+void vfio_pci_zdev_open(struct vfio_pci_core_device *vdev);
+void vfio_pci_zdev_release(struct vfio_pci_core_device *vdev);
#else
static inline int vfio_pci_info_zdev_add_caps(struct vfio_pci_core_device *vdev,
struct vfio_info_cap *caps)
{
return -ENODEV;
}
+
+static inline void vfio_pci_zdev_open(struct vfio_pci_core_device *vdev)
+{
+}
+
+static inline void vfio_pci_zdev_release(struct vfio_pci_core_device *vdev)
+{
+}
#endif

/* Will be exported for vfio pci drivers usage */
--
2.27.0


2022-02-07 12:05:25

by Matthew Rosato

[permalink] [raw]
Subject: [PATCH v3 29/30] KVM: s390: introduce CPU feature for zPCI Interpretation

KVM_S390_VM_CPU_FEAT_ZPCI_INTERP relays whether zPCI interpretive
execution is possible based on the available hardware facilities.

Signed-off-by: Matthew Rosato <[email protected]>
---
arch/s390/include/uapi/asm/kvm.h | 1 +
arch/s390/kvm/kvm-s390.c | 5 +++++
2 files changed, 6 insertions(+)

diff --git a/arch/s390/include/uapi/asm/kvm.h b/arch/s390/include/uapi/asm/kvm.h
index 7a6b14874d65..ed06458a871f 100644
--- a/arch/s390/include/uapi/asm/kvm.h
+++ b/arch/s390/include/uapi/asm/kvm.h
@@ -130,6 +130,7 @@ struct kvm_s390_vm_cpu_machine {
#define KVM_S390_VM_CPU_FEAT_PFMFI 11
#define KVM_S390_VM_CPU_FEAT_SIGPIF 12
#define KVM_S390_VM_CPU_FEAT_KSS 13
+#define KVM_S390_VM_CPU_FEAT_ZPCI_INTERP 14
struct kvm_s390_vm_cpu_feat {
__u64 feat[16];
};
diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index 208b09d08385..e20d9ac1935d 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -434,6 +434,11 @@ static void kvm_s390_cpu_feat_init(void)
if (test_facility(151)) /* DFLTCC */
__insn32_query(INSN_DFLTCC, kvm_s390_available_subfunc.dfltcc);

+ /* zPCI Interpretation */
+ if (IS_ENABLED(CONFIG_VFIO_PCI_ZDEV) && test_facility(69) &&
+ test_facility(70) && test_facility(71) && test_facility(72))
+ allow_cpu_feat(KVM_S390_VM_CPU_FEAT_ZPCI_INTERP);
+
if (MACHINE_HAS_ESOP)
allow_cpu_feat(KVM_S390_VM_CPU_FEAT_ESOP);
/*
--
2.27.0


2022-02-07 15:08:57

by Matthew Rosato

[permalink] [raw]
Subject: [PATCH v3 08/30] s390/pci: stash associated GISA designation

For passthrough devices, we will need to know the GISA designation of the
guest if interpretation facilities are to be used. Setup to stash this in
the zdev and set a default of 0 (no GISA designation) for now; a subsequent
patch will set a valid GISA designation for passthrough devices.
Also, extend mpcific routines to specify this stashed designation as part
of the mpcific command.

Reviewed-by: Niklas Schnelle <[email protected]>
Reviewed-by: Christian Borntraeger <[email protected]>
Reviewed-by: Eric Farman <[email protected]>
Reviewed-by: Pierre Morel <[email protected]>
Signed-off-by: Matthew Rosato <[email protected]>
---
arch/s390/include/asm/pci.h | 1 +
arch/s390/include/asm/pci_clp.h | 3 ++-
arch/s390/pci/pci.c | 6 ++++++
arch/s390/pci/pci_clp.c | 1 +
arch/s390/pci/pci_irq.c | 5 +++++
5 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/arch/s390/include/asm/pci.h b/arch/s390/include/asm/pci.h
index 90824be5ce9a..d07d7c3205de 100644
--- a/arch/s390/include/asm/pci.h
+++ b/arch/s390/include/asm/pci.h
@@ -123,6 +123,7 @@ struct zpci_dev {
enum zpci_state state;
u32 fid; /* function ID, used by sclp */
u32 fh; /* function handle, used by insn's */
+ u32 gisa; /* GISA designation for passthrough */
u16 vfn; /* virtual function number */
u16 pchid; /* physical channel ID */
u8 pfgid; /* function group ID */
diff --git a/arch/s390/include/asm/pci_clp.h b/arch/s390/include/asm/pci_clp.h
index 1f4b666e85ee..f3286bc5ba6e 100644
--- a/arch/s390/include/asm/pci_clp.h
+++ b/arch/s390/include/asm/pci_clp.h
@@ -173,7 +173,8 @@ struct clp_req_set_pci {
u16 reserved2;
u8 oc; /* operation controls */
u8 ndas; /* number of dma spaces */
- u64 reserved3;
+ u32 reserved3;
+ u32 gisa; /* GISA designation */
} __packed;

/* Set PCI function response */
diff --git a/arch/s390/pci/pci.c b/arch/s390/pci/pci.c
index 792f8e0f2178..ca9c29386de6 100644
--- a/arch/s390/pci/pci.c
+++ b/arch/s390/pci/pci.c
@@ -119,6 +119,7 @@ int zpci_register_ioat(struct zpci_dev *zdev, u8 dmaas,
fib.pba = base;
fib.pal = limit;
fib.iota = iota | ZPCI_IOTA_RTTO_FLAG;
+ fib.gd = zdev->gisa;
cc = zpci_mod_fc(req, &fib, &status);
if (cc)
zpci_dbg(3, "reg ioat fid:%x, cc:%d, status:%d\n", zdev->fid, cc, status);
@@ -132,6 +133,8 @@ int zpci_unregister_ioat(struct zpci_dev *zdev, u8 dmaas)
struct zpci_fib fib = {0};
u8 cc, status;

+ fib.gd = zdev->gisa;
+
cc = zpci_mod_fc(req, &fib, &status);
if (cc)
zpci_dbg(3, "unreg ioat fid:%x, cc:%d, status:%d\n", zdev->fid, cc, status);
@@ -159,6 +162,7 @@ int zpci_fmb_enable_device(struct zpci_dev *zdev)
atomic64_set(&zdev->unmapped_pages, 0);

fib.fmb_addr = virt_to_phys(zdev->fmb);
+ fib.gd = zdev->gisa;
cc = zpci_mod_fc(req, &fib, &status);
if (cc) {
kmem_cache_free(zdev_fmb_cache, zdev->fmb);
@@ -177,6 +181,8 @@ int zpci_fmb_disable_device(struct zpci_dev *zdev)
if (!zdev->fmb)
return -EINVAL;

+ fib.gd = zdev->gisa;
+
/* Function measurement is disabled if fmb address is zero */
cc = zpci_mod_fc(req, &fib, &status);
if (cc == 3) /* Function already gone. */
diff --git a/arch/s390/pci/pci_clp.c b/arch/s390/pci/pci_clp.c
index be077b39da33..4dcc37ddeeaf 100644
--- a/arch/s390/pci/pci_clp.c
+++ b/arch/s390/pci/pci_clp.c
@@ -240,6 +240,7 @@ static int clp_set_pci_fn(struct zpci_dev *zdev, u32 *fh, u8 nr_dma_as, u8 comma
rrb->request.fh = zdev->fh;
rrb->request.oc = command;
rrb->request.ndas = nr_dma_as;
+ rrb->request.gisa = zdev->gisa;

rc = clp_req(rrb, CLP_LPS_PCI);
if (rrb->response.hdr.rsp == CLP_RC_SETPCIFN_BUSY) {
diff --git a/arch/s390/pci/pci_irq.c b/arch/s390/pci/pci_irq.c
index 2f675355fd0c..a19ac0282929 100644
--- a/arch/s390/pci/pci_irq.c
+++ b/arch/s390/pci/pci_irq.c
@@ -43,6 +43,7 @@ static int zpci_set_airq(struct zpci_dev *zdev)
fib.fmt0.aibvo = 0; /* each zdev has its own interrupt vector */
fib.fmt0.aisb = virt_to_phys(zpci_sbv->vector) + (zdev->aisb / 64) * 8;
fib.fmt0.aisbo = zdev->aisb & 63;
+ fib.gd = zdev->gisa;

return zpci_mod_fc(req, &fib, &status) ? -EIO : 0;
}
@@ -54,6 +55,8 @@ static int zpci_clear_airq(struct zpci_dev *zdev)
struct zpci_fib fib = {0};
u8 cc, status;

+ fib.gd = zdev->gisa;
+
cc = zpci_mod_fc(req, &fib, &status);
if (cc == 3 || (cc == 1 && status == 24))
/* Function already gone or IRQs already deregistered. */
@@ -72,6 +75,7 @@ static int zpci_set_directed_irq(struct zpci_dev *zdev)
fib.fmt = 1;
fib.fmt1.noi = zdev->msi_nr_irqs;
fib.fmt1.dibvo = zdev->msi_first_bit;
+ fib.gd = zdev->gisa;

return zpci_mod_fc(req, &fib, &status) ? -EIO : 0;
}
@@ -84,6 +88,7 @@ static int zpci_clear_directed_irq(struct zpci_dev *zdev)
u8 cc, status;

fib.fmt = 1;
+ fib.gd = zdev->gisa;
cc = zpci_mod_fc(req, &fib, &status);
if (cc == 3 || (cc == 1 && status == 24))
/* Function already gone or IRQs already deregistered. */
--
2.27.0


2022-02-07 17:29:01

by Matthew Rosato

[permalink] [raw]
Subject: [PATCH v3 19/30] KVM: s390: pci: provide routines for enabling/disabling interpretation

These routines will be wired into the vfio_pci_zdev ioctl handlers to
respond to requests to enable / disable a device for zPCI Load/Store
interpretation.

The first time such a request is received, enable the necessary facilities
for the guest.

Signed-off-by: Matthew Rosato <[email protected]>
---
arch/s390/include/asm/kvm_pci.h | 4 ++
arch/s390/kvm/pci.c | 102 ++++++++++++++++++++++++++++++++
arch/s390/pci/pci.c | 3 +
3 files changed, 109 insertions(+)

diff --git a/arch/s390/include/asm/kvm_pci.h b/arch/s390/include/asm/kvm_pci.h
index ef10f9e46e37..422701d526dd 100644
--- a/arch/s390/include/asm/kvm_pci.h
+++ b/arch/s390/include/asm/kvm_pci.h
@@ -24,4 +24,8 @@ struct kvm_zdev {
int kvm_s390_pci_dev_open(struct zpci_dev *zdev);
void kvm_s390_pci_dev_release(struct zpci_dev *zdev);

+int kvm_s390_pci_interp_probe(struct zpci_dev *zdev);
+int kvm_s390_pci_interp_enable(struct zpci_dev *zdev);
+int kvm_s390_pci_interp_disable(struct zpci_dev *zdev);
+
#endif /* ASM_KVM_PCI_H */
diff --git a/arch/s390/kvm/pci.c b/arch/s390/kvm/pci.c
index 9b8390133e15..16bef3935284 100644
--- a/arch/s390/kvm/pci.c
+++ b/arch/s390/kvm/pci.c
@@ -12,7 +12,9 @@
#include <asm/kvm_pci.h>
#include <asm/pci.h>
#include <asm/pci_insn.h>
+#include <asm/sclp.h>
#include "pci.h"
+#include "kvm-s390.h"

struct zpci_aift *aift;

@@ -153,6 +155,106 @@ int kvm_s390_pci_aen_init(u8 nisc)
return rc;
}

+int kvm_s390_pci_interp_probe(struct zpci_dev *zdev)
+{
+ /* Must have appropriate hardware facilities */
+ if (!sclp.has_zpci_lsi || !sclp.has_aisii || !sclp.has_aeni ||
+ !sclp.has_aisi || !test_facility(69) || !test_facility(70) ||
+ !test_facility(71) || !test_facility(72)) {
+ return -EINVAL;
+ }
+
+ /* Must have a KVM association registered */
+ if (!zdev->kzdev || !zdev->kzdev->kvm)
+ return -EINVAL;
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(kvm_s390_pci_interp_probe);
+
+int kvm_s390_pci_interp_enable(struct zpci_dev *zdev)
+{
+ u32 gisa;
+ int rc;
+
+ if (!zdev->kzdev || !zdev->kzdev->kvm)
+ return -EINVAL;
+
+ /*
+ * If this is the first request to use an interpreted device, make the
+ * necessary vcpu changes
+ */
+ if (!zdev->kzdev->kvm->arch.use_zpci_interp)
+ kvm_s390_vcpu_pci_enable_interp(zdev->kzdev->kvm);
+
+ /*
+ * In the event of a system reset in userspace, the GISA designation
+ * may still be assigned because the device is still enabled.
+ * Verify it's the same guest before proceeding.
+ */
+ gisa = (u32)virt_to_phys(&zdev->kzdev->kvm->arch.sie_page2->gisa);
+ if (zdev->gisa != 0 && zdev->gisa != gisa)
+ return -EPERM;
+
+ if (zdev_enabled(zdev)) {
+ zdev->gisa = 0;
+ rc = zpci_disable_device(zdev);
+ if (rc)
+ return rc;
+ }
+
+ /*
+ * Store information about the identity of the kvm guest allowed to
+ * access this device via interpretation to be used by host CLP
+ */
+ zdev->gisa = gisa;
+
+ rc = zpci_enable_device(zdev);
+ if (rc)
+ goto err;
+
+ /* Re-register the IOMMU that was already created */
+ rc = zpci_register_ioat(zdev, 0, zdev->start_dma, zdev->end_dma,
+ virt_to_phys(zdev->dma_table));
+ if (rc)
+ goto err;
+
+ return rc;
+
+err:
+ zdev->gisa = 0;
+ return rc;
+}
+EXPORT_SYMBOL_GPL(kvm_s390_pci_interp_enable);
+
+int kvm_s390_pci_interp_disable(struct zpci_dev *zdev)
+{
+ int rc;
+
+ if (zdev->gisa == 0)
+ return -EINVAL;
+
+ /* Remove the host CLP guest designation */
+ zdev->gisa = 0;
+
+ if (zdev_enabled(zdev)) {
+ rc = zpci_disable_device(zdev);
+ if (rc)
+ return rc;
+ }
+
+ rc = zpci_enable_device(zdev);
+ if (rc)
+ return rc;
+
+ /* Re-register the IOMMU that was already created */
+ rc = zpci_register_ioat(zdev, 0, zdev->start_dma, zdev->end_dma,
+ virt_to_phys(zdev->dma_table));
+
+ return rc;
+}
+EXPORT_SYMBOL_GPL(kvm_s390_pci_interp_disable);
+
int kvm_s390_pci_dev_open(struct zpci_dev *zdev)
{
struct kvm_zdev *kzdev;
diff --git a/arch/s390/pci/pci.c b/arch/s390/pci/pci.c
index 13033717cd4e..5dbe49ec325e 100644
--- a/arch/s390/pci/pci.c
+++ b/arch/s390/pci/pci.c
@@ -147,6 +147,7 @@ int zpci_register_ioat(struct zpci_dev *zdev, u8 dmaas,
zpci_dbg(3, "reg ioat fid:%x, cc:%d, status:%d\n", zdev->fid, cc, status);
return cc;
}
+EXPORT_SYMBOL_GPL(zpci_register_ioat);

/* Modify PCI: Unregister I/O address translation parameters */
int zpci_unregister_ioat(struct zpci_dev *zdev, u8 dmaas)
@@ -727,6 +728,7 @@ int zpci_enable_device(struct zpci_dev *zdev)
zpci_update_fh(zdev, fh);
return rc;
}
+EXPORT_SYMBOL_GPL(zpci_enable_device);

int zpci_disable_device(struct zpci_dev *zdev)
{
@@ -750,6 +752,7 @@ int zpci_disable_device(struct zpci_dev *zdev)
}
return rc;
}
+EXPORT_SYMBOL_GPL(zpci_disable_device);

/**
* zpci_hot_reset_device - perform a reset of the given zPCI function
--
2.27.0


2022-02-07 18:10:56

by Matthew Rosato

[permalink] [raw]
Subject: [PATCH v3 01/30] s390/sclp: detect the zPCI load/store interpretation facility

Detect the zPCI Load/Store Interpretation facility.

Reviewed-by: Eric Farman <[email protected]>
Reviewed-by: Christian Borntraeger <[email protected]>
Reviewed-by: Claudio Imbrenda <[email protected]>
Signed-off-by: Matthew Rosato <[email protected]>
---
arch/s390/include/asm/sclp.h | 1 +
drivers/s390/char/sclp_early.c | 1 +
2 files changed, 2 insertions(+)

diff --git a/arch/s390/include/asm/sclp.h b/arch/s390/include/asm/sclp.h
index c68ea35de498..58a4d3d354b7 100644
--- a/arch/s390/include/asm/sclp.h
+++ b/arch/s390/include/asm/sclp.h
@@ -88,6 +88,7 @@ struct sclp_info {
unsigned char has_diag318 : 1;
unsigned char has_sipl : 1;
unsigned char has_dirq : 1;
+ unsigned char has_zpci_lsi : 1;
unsigned int ibc;
unsigned int mtid;
unsigned int mtid_cp;
diff --git a/drivers/s390/char/sclp_early.c b/drivers/s390/char/sclp_early.c
index e9943a86c361..b88dd0da1231 100644
--- a/drivers/s390/char/sclp_early.c
+++ b/drivers/s390/char/sclp_early.c
@@ -45,6 +45,7 @@ static void __init sclp_early_facilities_detect(void)
sclp.has_gisaf = !!(sccb->fac118 & 0x08);
sclp.has_hvs = !!(sccb->fac119 & 0x80);
sclp.has_kss = !!(sccb->fac98 & 0x01);
+ sclp.has_zpci_lsi = !!(sccb->fac118 & 0x01);
if (sccb->fac85 & 0x02)
S390_lowcore.machine_flags |= MACHINE_FLAG_ESOP;
if (sccb->fac91 & 0x40)
--
2.27.0


2022-02-07 19:28:08

by Matthew Rosato

[permalink] [raw]
Subject: [PATCH v3 21/30] KVM: s390: pci: provide routines for enabling/disabling IOAT assist

These routines will be wired into the vfio_pci_zdev ioctl handlers to
respond to requests to enable / disable a device for PCI I/O Address
Translation assistance.

Signed-off-by: Matthew Rosato <[email protected]>
---
arch/s390/include/asm/kvm_pci.h | 15 ++++
arch/s390/include/asm/pci_dma.h | 2 +
arch/s390/kvm/pci.c | 142 ++++++++++++++++++++++++++++++++
arch/s390/kvm/pci.h | 2 +
4 files changed, 161 insertions(+)

diff --git a/arch/s390/include/asm/kvm_pci.h b/arch/s390/include/asm/kvm_pci.h
index 377eebcba2d1..370d6ba041fe 100644
--- a/arch/s390/include/asm/kvm_pci.h
+++ b/arch/s390/include/asm/kvm_pci.h
@@ -15,11 +15,21 @@
#include <linux/kvm_host.h>
#include <linux/kvm.h>
#include <linux/pci.h>
+#include <linux/mutex.h>
#include <asm/pci_insn.h>
+#include <asm/pci_dma.h>
+
+struct kvm_zdev_ioat {
+ unsigned long *head[ZPCI_TABLE_PAGES];
+ unsigned long **seg;
+ unsigned long ***pt;
+ struct mutex ioat_lock; /* Must be held when modifying head/seg/pt */
+};

struct kvm_zdev {
struct zpci_dev *zdev;
struct kvm *kvm;
+ struct kvm_zdev_ioat ioat;
struct zpci_fib fib;
};

@@ -31,6 +41,11 @@ int kvm_s390_pci_aif_enable(struct zpci_dev *zdev, struct zpci_fib *fib,
bool assist);
int kvm_s390_pci_aif_disable(struct zpci_dev *zdev, bool force);

+int kvm_s390_pci_ioat_probe(struct zpci_dev *zdev);
+int kvm_s390_pci_ioat_enable(struct zpci_dev *zdev, u64 iota);
+int kvm_s390_pci_ioat_disable(struct zpci_dev *zdev);
+u8 kvm_s390_pci_get_dtsm(struct zpci_dev *zdev);
+
int kvm_s390_pci_interp_probe(struct zpci_dev *zdev);
int kvm_s390_pci_interp_enable(struct zpci_dev *zdev);
int kvm_s390_pci_interp_disable(struct zpci_dev *zdev, bool force);
diff --git a/arch/s390/include/asm/pci_dma.h b/arch/s390/include/asm/pci_dma.h
index 91e63426bdc5..69e616d0712c 100644
--- a/arch/s390/include/asm/pci_dma.h
+++ b/arch/s390/include/asm/pci_dma.h
@@ -50,6 +50,8 @@ enum zpci_ioat_dtype {
#define ZPCI_TABLE_ALIGN ZPCI_TABLE_SIZE
#define ZPCI_TABLE_ENTRY_SIZE (sizeof(unsigned long))
#define ZPCI_TABLE_ENTRIES (ZPCI_TABLE_SIZE / ZPCI_TABLE_ENTRY_SIZE)
+#define ZPCI_TABLE_PAGES (ZPCI_TABLE_SIZE >> PAGE_SHIFT)
+#define ZPCI_TABLE_ENTRIES_PAGES (ZPCI_TABLE_ENTRIES * ZPCI_TABLE_PAGES)

#define ZPCI_TABLE_BITS 11
#define ZPCI_PT_BITS 8
diff --git a/arch/s390/kvm/pci.c b/arch/s390/kvm/pci.c
index 8e9044a7bce7..fa9d6ac1755b 100644
--- a/arch/s390/kvm/pci.c
+++ b/arch/s390/kvm/pci.c
@@ -13,12 +13,15 @@
#include <asm/pci.h>
#include <asm/pci_insn.h>
#include <asm/pci_io.h>
+#include <asm/pci_dma.h>
#include <asm/sclp.h>
#include "pci.h"
#include "kvm-s390.h"

struct zpci_aift *aift;

+#define shadow_ioat_init zdev->kzdev->ioat.head[0]
+
static inline int __set_irq_noiib(u16 ctl, u8 isc)
{
union zpci_sic_iib iib = {{0}};
@@ -366,6 +369,138 @@ int kvm_s390_pci_aif_disable(struct zpci_dev *zdev, bool force)
}
EXPORT_SYMBOL_GPL(kvm_s390_pci_aif_disable);

+int kvm_s390_pci_ioat_probe(struct zpci_dev *zdev)
+{
+ /*
+ * Using the IOAT assist is only valid for a device with a KVM guest
+ * registered
+ */
+ if (!zdev->kzdev->kvm)
+ return -EINVAL;
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(kvm_s390_pci_ioat_probe);
+
+int kvm_s390_pci_ioat_enable(struct zpci_dev *zdev, u64 iota)
+{
+ gpa_t gpa = (gpa_t)(iota & ZPCI_RTE_ADDR_MASK);
+ struct kvm_zdev_ioat *ioat;
+ struct page *page;
+ struct kvm *kvm;
+ unsigned int idx;
+ void *iaddr;
+ int i, rc;
+
+ if (shadow_ioat_init)
+ return -EINVAL;
+
+ /* Ensure supported type specified */
+ if ((iota & ZPCI_IOTA_RTTO_FLAG) != ZPCI_IOTA_RTTO_FLAG)
+ return -EINVAL;
+
+ kvm = zdev->kzdev->kvm;
+ ioat = &zdev->kzdev->ioat;
+ mutex_lock(&ioat->ioat_lock);
+ idx = srcu_read_lock(&kvm->srcu);
+ for (i = 0; i < ZPCI_TABLE_PAGES; i++) {
+ page = gfn_to_page(kvm, gpa_to_gfn(gpa));
+ if (is_error_page(page)) {
+ srcu_read_unlock(&kvm->srcu, idx);
+ rc = -EIO;
+ goto unpin;
+ }
+ iaddr = page_to_virt(page) + (gpa & ~PAGE_MASK);
+ ioat->head[i] = (unsigned long *)iaddr;
+ gpa += PAGE_SIZE;
+ }
+ srcu_read_unlock(&kvm->srcu, idx);
+
+ ioat->seg = kcalloc(ZPCI_TABLE_ENTRIES_PAGES, sizeof(unsigned long *),
+ GFP_KERNEL);
+ if (!ioat->seg)
+ goto unpin;
+ ioat->pt = kcalloc(ZPCI_TABLE_ENTRIES, sizeof(unsigned long **),
+ GFP_KERNEL);
+ if (!ioat->pt)
+ goto free_seg;
+
+ mutex_unlock(&ioat->ioat_lock);
+ return 0;
+
+free_seg:
+ kfree(ioat->seg);
+ rc = -ENOMEM;
+unpin:
+ for (i = 0; i < ZPCI_TABLE_PAGES; i++) {
+ kvm_release_pfn_dirty((u64)ioat->head[i] >> PAGE_SHIFT);
+ ioat->head[i] = 0;
+ }
+ mutex_unlock(&ioat->ioat_lock);
+ return rc;
+}
+EXPORT_SYMBOL_GPL(kvm_s390_pci_ioat_enable);
+
+static void free_pt_entry(struct kvm_zdev_ioat *ioat, int st, int pt)
+{
+ if (!ioat->pt[st][pt])
+ return;
+
+ kvm_release_pfn_dirty((u64)ioat->pt[st][pt]);
+}
+
+static void free_seg_entry(struct kvm_zdev_ioat *ioat, int entry)
+{
+ int i, st, count = 0;
+
+ for (i = 0; i < ZPCI_TABLE_PAGES; i++) {
+ if (ioat->seg[entry + i]) {
+ kvm_release_pfn_dirty((u64)ioat->seg[entry + i]);
+ count++;
+ }
+ }
+
+ if (count == 0)
+ return;
+
+ st = entry / ZPCI_TABLE_PAGES;
+ for (i = 0; i < ZPCI_TABLE_ENTRIES; i++)
+ free_pt_entry(ioat, st, i);
+ kfree(ioat->pt[st]);
+}
+
+int kvm_s390_pci_ioat_disable(struct zpci_dev *zdev)
+{
+ struct kvm_zdev_ioat *ioat;
+ int i;
+
+ if (!shadow_ioat_init)
+ return -EINVAL;
+
+ ioat = &zdev->kzdev->ioat;
+ mutex_lock(&ioat->ioat_lock);
+ for (i = 0; i < ZPCI_TABLE_PAGES; i++) {
+ kvm_release_pfn_dirty((u64)ioat->head[i] >> PAGE_SHIFT);
+ ioat->head[i] = 0;
+ }
+
+ for (i = 0; i < ZPCI_TABLE_ENTRIES_PAGES; i += ZPCI_TABLE_PAGES)
+ free_seg_entry(ioat, i);
+
+ kfree(ioat->seg);
+ kfree(ioat->pt);
+ mutex_unlock(&ioat->ioat_lock);
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(kvm_s390_pci_ioat_disable);
+
+u8 kvm_s390_pci_get_dtsm(struct zpci_dev *zdev)
+{
+ return (zdev->dtsm & KVM_S390_PCI_DTSM_MASK);
+}
+EXPORT_SYMBOL_GPL(kvm_s390_pci_get_dtsm);
+
int kvm_s390_pci_interp_probe(struct zpci_dev *zdev)
{
/* Must have appropriate hardware facilities */
@@ -449,6 +584,10 @@ int kvm_s390_pci_interp_disable(struct zpci_dev *zdev, bool force)
if (zdev->kzdev->fib.fmt0.aibv != 0)
kvm_s390_pci_aif_disable(zdev, force);

+ /* If we are using the IOAT assist, disable it now */
+ if (zdev->kzdev->ioat.head[0])
+ kvm_s390_pci_ioat_disable(zdev);
+
/* Remove the host CLP guest designation */
zdev->gisa = 0;

@@ -478,6 +617,8 @@ int kvm_s390_pci_dev_open(struct zpci_dev *zdev)
if (!kzdev)
return -ENOMEM;

+ mutex_init(&kzdev->ioat.ioat_lock);
+
kzdev->zdev = zdev;
zdev->kzdev = kzdev;

@@ -492,6 +633,7 @@ void kvm_s390_pci_dev_release(struct zpci_dev *zdev)
kzdev = zdev->kzdev;
WARN_ON(kzdev->zdev != zdev);
zdev->kzdev = 0;
+ mutex_destroy(&kzdev->ioat.ioat_lock);
kfree(kzdev);
}
EXPORT_SYMBOL_GPL(kvm_s390_pci_dev_release);
diff --git a/arch/s390/kvm/pci.h b/arch/s390/kvm/pci.h
index 4d3db58beb74..0fa0e3d61aaa 100644
--- a/arch/s390/kvm/pci.h
+++ b/arch/s390/kvm/pci.h
@@ -16,6 +16,8 @@
#include <asm/airq.h>
#include <asm/kvm_pci.h>

+#define KVM_S390_PCI_DTSM_MASK 0x40
+
struct zpci_gaite {
u32 gisa;
u8 gisc;
--
2.27.0


2022-02-07 20:02:54

by Matthew Rosato

[permalink] [raw]
Subject: [PATCH v3 04/30] s390/sclp: detect the AISI facility

Detect the Adapter Interruption Suppression Interpretation facility.

Reviewed-by: Eric Farman <[email protected]>
Reviewed-by: Christian Borntraeger <[email protected]>
Reviewed-by: Claudio Imbrenda <[email protected]>
Signed-off-by: Matthew Rosato <[email protected]>
---
arch/s390/include/asm/sclp.h | 1 +
drivers/s390/char/sclp_early.c | 1 +
2 files changed, 2 insertions(+)

diff --git a/arch/s390/include/asm/sclp.h b/arch/s390/include/asm/sclp.h
index 8c2e142000d4..33b174007848 100644
--- a/arch/s390/include/asm/sclp.h
+++ b/arch/s390/include/asm/sclp.h
@@ -91,6 +91,7 @@ struct sclp_info {
unsigned char has_zpci_lsi : 1;
unsigned char has_aisii : 1;
unsigned char has_aeni : 1;
+ unsigned char has_aisi : 1;
unsigned int ibc;
unsigned int mtid;
unsigned int mtid_cp;
diff --git a/drivers/s390/char/sclp_early.c b/drivers/s390/char/sclp_early.c
index e9af01b4c97a..c13e55cc4a5d 100644
--- a/drivers/s390/char/sclp_early.c
+++ b/drivers/s390/char/sclp_early.c
@@ -47,6 +47,7 @@ static void __init sclp_early_facilities_detect(void)
sclp.has_kss = !!(sccb->fac98 & 0x01);
sclp.has_aisii = !!(sccb->fac118 & 0x40);
sclp.has_aeni = !!(sccb->fac118 & 0x20);
+ sclp.has_aisi = !!(sccb->fac118 & 0x10);
sclp.has_zpci_lsi = !!(sccb->fac118 & 0x01);
if (sccb->fac85 & 0x02)
S390_lowcore.machine_flags |= MACHINE_FLAG_ESOP;
--
2.27.0


2022-02-07 21:43:17

by Cornelia Huck

[permalink] [raw]
Subject: Re: [PATCH v3 06/30] s390/airq: allow for airq structure that uses an input vector

On Fri, Feb 04 2022, Matthew Rosato <[email protected]> wrote:

> When doing device passthrough where interrupts are being forwarded from
> host to guest, we wish to use a pinned section of guest memory as the
> vector (the same memory used by the guest as the vector). To accomplish
> this, add a new parameter for airq_iv_create which allows passing an
> existing vector to be used instead of allocating a new one. The caller
> is responsible for ensuring the vector is pinned in memory as well as for
> unpinning the memory when the vector is no longer needed.
>
> A subsequent patch will use this new parameter for zPCI interpretation.
>
> Reviewed-by: Pierre Morel <[email protected]>
> Signed-off-by: Matthew Rosato <[email protected]>
> ---
> arch/s390/include/asm/airq.h | 4 +++-
> arch/s390/pci/pci_irq.c | 8 ++++----
> drivers/s390/cio/airq.c | 10 +++++++---
> drivers/s390/virtio/virtio_ccw.c | 2 +-
> 4 files changed, 15 insertions(+), 9 deletions(-)

For virtio-ccw:

Acked-by: Cornelia Huck <[email protected]>


2022-02-08 14:39:00

by Pierre Morel

[permalink] [raw]
Subject: Re: [PATCH v3 12/30] s390/pci: get SHM information from list pci



On 2/4/22 22:15, Matthew Rosato wrote:
> KVM will need information on the special handle mask used to indicate
> emulated devices. In order to obtain this, a new type of list pci call
> must be made to gather the information. Extend clp_list_pci_req to
> also fetch the model-dependent-data field that holds this mask.
>
> Reviewed-by: Niklas Schnelle <[email protected]>
> Signed-off-by: Matthew Rosato <[email protected]>
> ---
> arch/s390/include/asm/pci.h | 1 +
> arch/s390/include/asm/pci_clp.h | 2 +-
> arch/s390/pci/pci_clp.c | 25 ++++++++++++++++++++++---
> 3 files changed, 24 insertions(+), 4 deletions(-)
>
> diff --git a/arch/s390/include/asm/pci.h b/arch/s390/include/asm/pci.h
> index 3c0b9986dcdc..e8a3fd5bc169 100644
> --- a/arch/s390/include/asm/pci.h
> +++ b/arch/s390/include/asm/pci.h
> @@ -227,6 +227,7 @@ int clp_enable_fh(struct zpci_dev *zdev, u32 *fh, u8 nr_dma_as);
> int clp_disable_fh(struct zpci_dev *zdev, u32 *fh);
> int clp_get_state(u32 fid, enum zpci_state *state);
> int clp_refresh_fh(u32 fid, u32 *fh);
> +int zpci_get_mdd(u32 *mdd);
>
> /* UID */
> void update_uid_checking(bool new);
> diff --git a/arch/s390/include/asm/pci_clp.h b/arch/s390/include/asm/pci_clp.h
> index d6189ed14f84..dc2041e97de4 100644
> --- a/arch/s390/include/asm/pci_clp.h
> +++ b/arch/s390/include/asm/pci_clp.h
> @@ -76,7 +76,7 @@ struct clp_req_list_pci {
> struct clp_rsp_list_pci {
> struct clp_rsp_hdr hdr;
> u64 resume_token;
> - u32 reserved2;
> + u32 mdd;
> u16 max_fn;
> u8 : 7;
> u8 uid_checking : 1;
> diff --git a/arch/s390/pci/pci_clp.c b/arch/s390/pci/pci_clp.c
> index dc733b58e74f..7477956be632 100644
> --- a/arch/s390/pci/pci_clp.c
> +++ b/arch/s390/pci/pci_clp.c
> @@ -328,7 +328,7 @@ int clp_disable_fh(struct zpci_dev *zdev, u32 *fh)
> }
>
> static int clp_list_pci_req(struct clp_req_rsp_list_pci *rrb,
> - u64 *resume_token, int *nentries)
> + u64 *resume_token, int *nentries, u32 *mdd)
> {
> int rc;
>
> @@ -354,6 +354,8 @@ static int clp_list_pci_req(struct clp_req_rsp_list_pci *rrb,
> *nentries = (rrb->response.hdr.len - LIST_PCI_HDR_LEN) /
> rrb->response.entry_size;
> *resume_token = rrb->response.resume_token;
> + if (mdd)
> + *mdd = rrb->response.mdd;

mdd is a central value for ZPCI like zpci_unique_uid, I think both
should be treated the same, i.e. belong to a central ZPCI structure or
as today with zpci_unique_uid be global.
An aventage of the design is that the clp_list_pci_req signature does
not need to be modified and we do not need extra clp call.

However I come here late with this demand so I guess this simplification
can be done later.

Acked-by: Pierre Morel <[email protected]>


>
> return rc;
> }
> @@ -365,7 +367,7 @@ static int clp_list_pci(struct clp_req_rsp_list_pci *rrb, void *data,
> int nentries, i, rc;
>
> do {
> - rc = clp_list_pci_req(rrb, &resume_token, &nentries);
> + rc = clp_list_pci_req(rrb, &resume_token, &nentries, NULL);
> if (rc)
> return rc;
> for (i = 0; i < nentries; i++)
> @@ -383,7 +385,7 @@ static int clp_find_pci(struct clp_req_rsp_list_pci *rrb, u32 fid,
> int nentries, i, rc;
>
> do {
> - rc = clp_list_pci_req(rrb, &resume_token, &nentries);
> + rc = clp_list_pci_req(rrb, &resume_token, &nentries, NULL);
> if (rc)
> return rc;
> fh_list = rrb->response.fh_list;
> @@ -468,6 +470,23 @@ int clp_get_state(u32 fid, enum zpci_state *state)
> return rc;
> }
>
> +int zpci_get_mdd(u32 *mdd)
> +{
> + struct clp_req_rsp_list_pci *rrb;
> + u64 resume_token = 0;
> + int nentries, rc;
> +
> + rrb = clp_alloc_block(GFP_KERNEL);
> + if (!rrb)
> + return -ENOMEM;
> +
> + rc = clp_list_pci_req(rrb, &resume_token, &nentries, mdd);
> +
> + clp_free_block(rrb);
> + return rc;
> +}
> +EXPORT_SYMBOL_GPL(zpci_get_mdd);
> +
> static int clp_base_slpc(struct clp_req *req, struct clp_req_rsp_slpc *lpcb)
> {
> unsigned long limit = PAGE_SIZE - sizeof(lpcb->request);
>

--
Pierre Morel
IBM Lab Boeblingen

2022-02-08 18:52:21

by Pierre Morel

[permalink] [raw]
Subject: Re: [PATCH v3 29/30] KVM: s390: introduce CPU feature for zPCI Interpretation



On 2/4/22 22:15, Matthew Rosato wrote:
> KVM_S390_VM_CPU_FEAT_ZPCI_INTERP relays whether zPCI interpretive
> execution is possible based on the available hardware facilities.
>
> Signed-off-by: Matthew Rosato <[email protected]>

Reviewed-by: Pierre Morel <[email protected]>



> ---
> arch/s390/include/uapi/asm/kvm.h | 1 +
> arch/s390/kvm/kvm-s390.c | 5 +++++
> 2 files changed, 6 insertions(+)
>
> diff --git a/arch/s390/include/uapi/asm/kvm.h b/arch/s390/include/uapi/asm/kvm.h
> index 7a6b14874d65..ed06458a871f 100644
> --- a/arch/s390/include/uapi/asm/kvm.h
> +++ b/arch/s390/include/uapi/asm/kvm.h
> @@ -130,6 +130,7 @@ struct kvm_s390_vm_cpu_machine {
> #define KVM_S390_VM_CPU_FEAT_PFMFI 11
> #define KVM_S390_VM_CPU_FEAT_SIGPIF 12
> #define KVM_S390_VM_CPU_FEAT_KSS 13
> +#define KVM_S390_VM_CPU_FEAT_ZPCI_INTERP 14
> struct kvm_s390_vm_cpu_feat {
> __u64 feat[16];
> };
> diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
> index 208b09d08385..e20d9ac1935d 100644
> --- a/arch/s390/kvm/kvm-s390.c
> +++ b/arch/s390/kvm/kvm-s390.c
> @@ -434,6 +434,11 @@ static void kvm_s390_cpu_feat_init(void)
> if (test_facility(151)) /* DFLTCC */
> __insn32_query(INSN_DFLTCC, kvm_s390_available_subfunc.dfltcc);
>
> + /* zPCI Interpretation */
> + if (IS_ENABLED(CONFIG_VFIO_PCI_ZDEV) && test_facility(69) &&
> + test_facility(70) && test_facility(71) && test_facility(72))
> + allow_cpu_feat(KVM_S390_VM_CPU_FEAT_ZPCI_INTERP);
> +
> if (MACHINE_HAS_ESOP)
> allow_cpu_feat(KVM_S390_VM_CPU_FEAT_ESOP);
> /*
>

--
Pierre Morel
IBM Lab Boeblingen

2022-02-09 04:08:55

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v3 24/30] vfio-pci/zdev: wire up group notifier

On Tue, Feb 08, 2022 at 12:26:24PM -0700, Alex Williamson wrote:

> > Personally, I think it is wrong layering for VFIO to be aware of KVM
> > like this. This marks the first time that VFIO core code itself is
> > being made aware of the KVM linkage.
>
> I agree, but I've resigned that I've lost that battle. Both mdev vGPU
> vendors make specific assumptions about running on a VM.

The vGPU's are not as egregious though, are they?

> > Or, at the very least, everything needs to be described in some way
> > that makes it clear what is happening to userspace, without kvm,
> > through these ioctls.
>
> As I understand the discussion here:
>
> https://lore.kernel.org/all/[email protected]/
>
> The assumption is that there is no non-KVM userspace currently. This
> seems like a regression to me.

Indeed, I definitely don't like it either. This is not VFIO if is
just driving KVM.

I would prefer they add a function to get the 'struct device *' from a
VFIO device fd and drive more of this from kvm, as appropriate.

> > > this is meant to extend vfio-pci proper for the whole arch. Is there a
> > > compromise in using #ifdefs in vfio_pci_ops to call into zpci specific
> > > code that implements these arch specific hooks and the core for
> > > everything else? SPAPR code could probably converted similarly, it
> > > exists here for legacy reasons. [Cc Jason]
> >
> > I'm not sure I get what you are suggesting? Where would these ifdefs
> > be?
>
> Essentially just:
>
> static const struct vfio_device_ops vfio_pci_ops = {
> .name = "vfio-pci",
> #ifdef CONFIG_S390
> .open_device = vfio_zpci_open_device,
> .close_device = vfio_zpci_close_device,
> .ioctl = vfio_zpci_ioctl,
> #else
> .open_device = vfio_pci_open_device,
> .close_device = vfio_pci_core_close_device,
> .ioctl = vfio_pci_core_ioctl,
> #endif
> .read = vfio_pci_core_read,
> .write = vfio_pci_core_write,
> .mmap = vfio_pci_core_mmap,
> .request = vfio_pci_core_request,
> .match = vfio_pci_core_match,
> };
>
> It would at least provide more validation/exercise of the core/vendor
> split. Thanks,

This would have to be in every pci driver - this is not just code the
universal vfio-pci has to enable, but every migration driver/etc too.

And we will need it again in vfio-cxl for s390 in 10 years too..

So, I think this approach is the right one, asided from the
philosophical question of being so tightly linking s390 vfio to KVM.

Jason

2022-02-09 06:21:51

by Claudio Imbrenda

[permalink] [raw]
Subject: Re: [PATCH v3 06/30] s390/airq: allow for airq structure that uses an input vector

On Fri, 4 Feb 2022 16:15:12 -0500
Matthew Rosato <[email protected]> wrote:

> When doing device passthrough where interrupts are being forwarded from
> host to guest, we wish to use a pinned section of guest memory as the
> vector (the same memory used by the guest as the vector). To accomplish
> this, add a new parameter for airq_iv_create which allows passing an
> existing vector to be used instead of allocating a new one. The caller
> is responsible for ensuring the vector is pinned in memory as well as for
> unpinning the memory when the vector is no longer needed.
>
> A subsequent patch will use this new parameter for zPCI interpretation.
>
> Reviewed-by: Pierre Morel <[email protected]>
> Signed-off-by: Matthew Rosato <[email protected]>

Reviewed-by: Claudio Imbrenda <[email protected]>

> ---
> arch/s390/include/asm/airq.h | 4 +++-
> arch/s390/pci/pci_irq.c | 8 ++++----
> drivers/s390/cio/airq.c | 10 +++++++---
> drivers/s390/virtio/virtio_ccw.c | 2 +-
> 4 files changed, 15 insertions(+), 9 deletions(-)
>
> diff --git a/arch/s390/include/asm/airq.h b/arch/s390/include/asm/airq.h
> index 7918a7d09028..e82e5626e139 100644
> --- a/arch/s390/include/asm/airq.h
> +++ b/arch/s390/include/asm/airq.h
> @@ -47,8 +47,10 @@ struct airq_iv {
> #define AIRQ_IV_PTR 4 /* Allocate the ptr array */
> #define AIRQ_IV_DATA 8 /* Allocate the data array */
> #define AIRQ_IV_CACHELINE 16 /* Cacheline alignment for the vector */
> +#define AIRQ_IV_GUESTVEC 32 /* Vector is a pinned guest page */
>
> -struct airq_iv *airq_iv_create(unsigned long bits, unsigned long flags);
> +struct airq_iv *airq_iv_create(unsigned long bits, unsigned long flags,
> + unsigned long *vec);
> void airq_iv_release(struct airq_iv *iv);
> unsigned long airq_iv_alloc(struct airq_iv *iv, unsigned long num);
> void airq_iv_free(struct airq_iv *iv, unsigned long bit, unsigned long num);
> diff --git a/arch/s390/pci/pci_irq.c b/arch/s390/pci/pci_irq.c
> index cc4c8d7c8f5c..0d0a02a9fbbf 100644
> --- a/arch/s390/pci/pci_irq.c
> +++ b/arch/s390/pci/pci_irq.c
> @@ -296,7 +296,7 @@ int arch_setup_msi_irqs(struct pci_dev *pdev, int nvec, int type)
> zdev->aisb = bit;
>
> /* Create adapter interrupt vector */
> - zdev->aibv = airq_iv_create(msi_vecs, AIRQ_IV_DATA | AIRQ_IV_BITLOCK);
> + zdev->aibv = airq_iv_create(msi_vecs, AIRQ_IV_DATA | AIRQ_IV_BITLOCK, NULL);
> if (!zdev->aibv)
> return -ENOMEM;
>
> @@ -419,7 +419,7 @@ static int __init zpci_directed_irq_init(void)
> union zpci_sic_iib iib = {{0}};
> unsigned int cpu;
>
> - zpci_sbv = airq_iv_create(num_possible_cpus(), 0);
> + zpci_sbv = airq_iv_create(num_possible_cpus(), 0, NULL);
> if (!zpci_sbv)
> return -ENOMEM;
>
> @@ -441,7 +441,7 @@ static int __init zpci_directed_irq_init(void)
> zpci_ibv[cpu] = airq_iv_create(cache_line_size() * BITS_PER_BYTE,
> AIRQ_IV_DATA |
> AIRQ_IV_CACHELINE |
> - (!cpu ? AIRQ_IV_ALLOC : 0));
> + (!cpu ? AIRQ_IV_ALLOC : 0), NULL);
> if (!zpci_ibv[cpu])
> return -ENOMEM;
> }
> @@ -458,7 +458,7 @@ static int __init zpci_floating_irq_init(void)
> if (!zpci_ibv)
> return -ENOMEM;
>
> - zpci_sbv = airq_iv_create(ZPCI_NR_DEVICES, AIRQ_IV_ALLOC);
> + zpci_sbv = airq_iv_create(ZPCI_NR_DEVICES, AIRQ_IV_ALLOC, NULL);
> if (!zpci_sbv)
> goto out_free;
>
> diff --git a/drivers/s390/cio/airq.c b/drivers/s390/cio/airq.c
> index 2f2226786319..375a58b1c838 100644
> --- a/drivers/s390/cio/airq.c
> +++ b/drivers/s390/cio/airq.c
> @@ -122,10 +122,12 @@ static inline unsigned long iv_size(unsigned long bits)
> * airq_iv_create - create an interrupt vector
> * @bits: number of bits in the interrupt vector
> * @flags: allocation flags
> + * @vec: pointer to pinned guest memory if AIRQ_IV_GUESTVEC
> *
> * Returns a pointer to an interrupt vector structure
> */
> -struct airq_iv *airq_iv_create(unsigned long bits, unsigned long flags)
> +struct airq_iv *airq_iv_create(unsigned long bits, unsigned long flags,
> + unsigned long *vec)
> {
> struct airq_iv *iv;
> unsigned long size;
> @@ -146,6 +148,8 @@ struct airq_iv *airq_iv_create(unsigned long bits, unsigned long flags)
> &iv->vector_dma);
> if (!iv->vector)
> goto out_free;
> + } else if (flags & AIRQ_IV_GUESTVEC) {
> + iv->vector = vec;
> } else {
> iv->vector = cio_dma_zalloc(size);
> if (!iv->vector)
> @@ -185,7 +189,7 @@ struct airq_iv *airq_iv_create(unsigned long bits, unsigned long flags)
> kfree(iv->avail);
> if (iv->flags & AIRQ_IV_CACHELINE && iv->vector)
> dma_pool_free(airq_iv_cache, iv->vector, iv->vector_dma);
> - else
> + else if (!(iv->flags & AIRQ_IV_GUESTVEC))
> cio_dma_free(iv->vector, size);
> kfree(iv);
> out:
> @@ -204,7 +208,7 @@ void airq_iv_release(struct airq_iv *iv)
> kfree(iv->bitlock);
> if (iv->flags & AIRQ_IV_CACHELINE)
> dma_pool_free(airq_iv_cache, iv->vector, iv->vector_dma);
> - else
> + else if (!(iv->flags & AIRQ_IV_GUESTVEC))
> cio_dma_free(iv->vector, iv_size(iv->bits));
> kfree(iv->avail);
> kfree(iv);
> diff --git a/drivers/s390/virtio/virtio_ccw.c b/drivers/s390/virtio/virtio_ccw.c
> index 52c376d15978..410498d693f8 100644
> --- a/drivers/s390/virtio/virtio_ccw.c
> +++ b/drivers/s390/virtio/virtio_ccw.c
> @@ -241,7 +241,7 @@ static struct airq_info *new_airq_info(int index)
> return NULL;
> rwlock_init(&info->lock);
> info->aiv = airq_iv_create(VIRTIO_IV_BITS, AIRQ_IV_ALLOC | AIRQ_IV_PTR
> - | AIRQ_IV_CACHELINE);
> + | AIRQ_IV_CACHELINE, NULL);
> if (!info->aiv) {
> kfree(info);
> return NULL;


2022-02-09 06:41:28

by Alex Williamson

[permalink] [raw]
Subject: Re: [PATCH v3 24/30] vfio-pci/zdev: wire up group notifier

On Fri, 4 Feb 2022 16:15:30 -0500
Matthew Rosato <[email protected]> wrote:

> KVM zPCI passthrough device logic will need a reference to the associated
> kvm guest that has access to the device. Let's register a group notifier
> for VFIO_GROUP_NOTIFY_SET_KVM to catch this information in order to create
> an association between a kvm guest and the host zdev.
>
> Signed-off-by: Matthew Rosato <[email protected]>
> ---
> arch/s390/include/asm/kvm_pci.h | 2 ++
> drivers/vfio/pci/vfio_pci_core.c | 2 ++
> drivers/vfio/pci/vfio_pci_zdev.c | 46 ++++++++++++++++++++++++++++++++
> include/linux/vfio_pci_core.h | 10 +++++++
> 4 files changed, 60 insertions(+)
>
> diff --git a/arch/s390/include/asm/kvm_pci.h b/arch/s390/include/asm/kvm_pci.h
> index e4696f5592e1..16290b4cf2a6 100644
> --- a/arch/s390/include/asm/kvm_pci.h
> +++ b/arch/s390/include/asm/kvm_pci.h
> @@ -16,6 +16,7 @@
> #include <linux/kvm.h>
> #include <linux/pci.h>
> #include <linux/mutex.h>
> +#include <linux/notifier.h>
> #include <asm/pci_insn.h>
> #include <asm/pci_dma.h>
>
> @@ -32,6 +33,7 @@ struct kvm_zdev {
> u64 rpcit_count;
> struct kvm_zdev_ioat ioat;
> struct zpci_fib fib;
> + struct notifier_block nb;
> };
>
> int kvm_s390_pci_dev_open(struct zpci_dev *zdev);
> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> index f948e6cd2993..fc57d4d0abbe 100644
> --- a/drivers/vfio/pci/vfio_pci_core.c
> +++ b/drivers/vfio/pci/vfio_pci_core.c
> @@ -452,6 +452,7 @@ void vfio_pci_core_close_device(struct vfio_device *core_vdev)
>
> vfio_pci_vf_token_user_add(vdev, -1);
> vfio_spapr_pci_eeh_release(vdev->pdev);
> + vfio_pci_zdev_release(vdev);
> vfio_pci_core_disable(vdev);
>
> mutex_lock(&vdev->igate);
> @@ -470,6 +471,7 @@ EXPORT_SYMBOL_GPL(vfio_pci_core_close_device);
> void vfio_pci_core_finish_enable(struct vfio_pci_core_device *vdev)
> {
> vfio_pci_probe_mmaps(vdev);
> + vfio_pci_zdev_open(vdev);
> vfio_spapr_pci_eeh_open(vdev->pdev);
> vfio_pci_vf_token_user_add(vdev, 1);
> }

If this handling were for a specific device, I think we'd be suggesting
this is the point at which we cross over to a vendor variant making use
of vfio-pci-core rather than hooking directly into the core code. But
this is meant to extend vfio-pci proper for the whole arch. Is there a
compromise in using #ifdefs in vfio_pci_ops to call into zpci specific
code that implements these arch specific hooks and the core for
everything else? SPAPR code could probably converted similarly, it
exists here for legacy reasons. [Cc Jason]

Also, please note the DEVICE_FEATURE generalizations in the latest
series from NVIDIA for mlx5 migration support:

https://lore.kernel.org/all/[email protected]/

If this series were to go in via the s390 tree, I'd request a branch so
that we can continue to work on this in vfio code as well. Thanks,

Alex

> diff --git a/drivers/vfio/pci/vfio_pci_zdev.c b/drivers/vfio/pci/vfio_pci_zdev.c
> index ea4c0d2b0663..9f8284499111 100644
> --- a/drivers/vfio/pci/vfio_pci_zdev.c
> +++ b/drivers/vfio/pci/vfio_pci_zdev.c
> @@ -13,6 +13,7 @@
> #include <linux/vfio_zdev.h>
> #include <asm/pci_clp.h>
> #include <asm/pci_io.h>
> +#include <asm/kvm_pci.h>
>
> #include <linux/vfio_pci_core.h>
>
> @@ -136,3 +137,48 @@ int vfio_pci_info_zdev_add_caps(struct vfio_pci_core_device *vdev,
>
> return ret;
> }
> +
> +static int vfio_pci_zdev_group_notifier(struct notifier_block *nb,
> + unsigned long action, void *data)
> +{
> + struct kvm_zdev *kzdev = container_of(nb, struct kvm_zdev, nb);
> +
> + if (action == VFIO_GROUP_NOTIFY_SET_KVM) {
> + if (!data || !kzdev->zdev)
> + return NOTIFY_DONE;
> + kzdev->kvm = data;
> + }
> +
> + return NOTIFY_OK;
> +}
> +
> +void vfio_pci_zdev_open(struct vfio_pci_core_device *vdev)
> +{
> + unsigned long events = VFIO_GROUP_NOTIFY_SET_KVM;
> + struct zpci_dev *zdev = to_zpci(vdev->pdev);
> +
> + if (!zdev)
> + return;
> +
> + if (kvm_s390_pci_dev_open(zdev))
> + return;
> +
> + zdev->kzdev->nb.notifier_call = vfio_pci_zdev_group_notifier;
> +
> + if (vfio_register_notifier(vdev->vdev.dev, VFIO_GROUP_NOTIFY,
> + &events, &zdev->kzdev->nb))
> + kvm_s390_pci_dev_release(zdev);
> +}
> +
> +void vfio_pci_zdev_release(struct vfio_pci_core_device *vdev)
> +{
> + struct zpci_dev *zdev = to_zpci(vdev->pdev);
> +
> + if (!zdev || !zdev->kzdev)
> + return;
> +
> + vfio_unregister_notifier(vdev->vdev.dev, VFIO_GROUP_NOTIFY,
> + &zdev->kzdev->nb);
> +
> + kvm_s390_pci_dev_release(zdev);
> +}
> diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
> index 5e2bca3b89db..05287f8ac855 100644
> --- a/include/linux/vfio_pci_core.h
> +++ b/include/linux/vfio_pci_core.h
> @@ -198,12 +198,22 @@ static inline int vfio_pci_igd_init(struct vfio_pci_core_device *vdev)
> #ifdef CONFIG_VFIO_PCI_ZDEV
> extern int vfio_pci_info_zdev_add_caps(struct vfio_pci_core_device *vdev,
> struct vfio_info_cap *caps);
> +void vfio_pci_zdev_open(struct vfio_pci_core_device *vdev);
> +void vfio_pci_zdev_release(struct vfio_pci_core_device *vdev);
> #else
> static inline int vfio_pci_info_zdev_add_caps(struct vfio_pci_core_device *vdev,
> struct vfio_info_cap *caps)
> {
> return -ENODEV;
> }
> +
> +static inline void vfio_pci_zdev_open(struct vfio_pci_core_device *vdev)
> +{
> +}
> +
> +static inline void vfio_pci_zdev_release(struct vfio_pci_core_device *vdev)
> +{
> +}
> #endif
>
> /* Will be exported for vfio pci drivers usage */


2022-02-09 07:11:22

by Matthew Rosato

[permalink] [raw]
Subject: [PATCH v3 25/30] vfio-pci/zdev: wire up zPCI interpretive execution support

Introduce support for VFIO_DEVICE_FEATURE_ZPCI_INTERP, which is a new
VFIO_DEVICE_FEATURE ioctl. This interface is used to indicate that an
s390x vfio-pci device wishes to enable/disable zPCI interpretive
execution, which allows zPCI instructions to be executed directly by
underlying firmware without KVM involvement.

Signed-off-by: Matthew Rosato <[email protected]>
---
arch/s390/include/asm/kvm_pci.h | 1 +
drivers/vfio/pci/vfio_pci_core.c | 2 +
drivers/vfio/pci/vfio_pci_zdev.c | 73 ++++++++++++++++++++++++++++++++
include/linux/vfio_pci_core.h | 10 +++++
include/uapi/linux/vfio.h | 7 +++
include/uapi/linux/vfio_zdev.h | 15 +++++++
6 files changed, 108 insertions(+)

diff --git a/arch/s390/include/asm/kvm_pci.h b/arch/s390/include/asm/kvm_pci.h
index 16290b4cf2a6..8f7d371e3e59 100644
--- a/arch/s390/include/asm/kvm_pci.h
+++ b/arch/s390/include/asm/kvm_pci.h
@@ -34,6 +34,7 @@ struct kvm_zdev {
struct kvm_zdev_ioat ioat;
struct zpci_fib fib;
struct notifier_block nb;
+ bool interpretation;
};

int kvm_s390_pci_dev_open(struct zpci_dev *zdev);
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index fc57d4d0abbe..2b2d64a2190c 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -1172,6 +1172,8 @@ long vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd,
mutex_unlock(&vdev->vf_token->lock);

return 0;
+ case VFIO_DEVICE_FEATURE_ZPCI_INTERP:
+ return vfio_pci_zdev_feat_interp(vdev, feature, arg);
default:
return -ENOTTY;
}
diff --git a/drivers/vfio/pci/vfio_pci_zdev.c b/drivers/vfio/pci/vfio_pci_zdev.c
index 9f8284499111..71cc1eb8085f 100644
--- a/drivers/vfio/pci/vfio_pci_zdev.c
+++ b/drivers/vfio/pci/vfio_pci_zdev.c
@@ -54,6 +54,10 @@ static int zpci_group_cap(struct zpci_dev *zdev, struct vfio_info_cap *caps)
.version = zdev->version
};

+ /* Some values are different for interpreted devices */
+ if (zdev->kzdev && zdev->kzdev->interpretation)
+ cap.maxstbl = zdev->maxstbl;
+
return vfio_info_add_capability(caps, &cap.header, sizeof(cap));
}

@@ -138,6 +142,67 @@ int vfio_pci_info_zdev_add_caps(struct vfio_pci_core_device *vdev,
return ret;
}

+int vfio_pci_zdev_feat_interp(struct vfio_pci_core_device *vdev,
+ struct vfio_device_feature feature,
+ unsigned long arg)
+{
+ struct zpci_dev *zdev = to_zpci(vdev->pdev);
+ struct vfio_device_zpci_interp *data;
+ struct vfio_device_feature *feat;
+ unsigned long minsz;
+ int size, rc;
+
+ if (!zdev || !zdev->kzdev)
+ return -EINVAL;
+
+ /* If PROBE specified, return probe results immediately */
+ if (feature.flags & VFIO_DEVICE_FEATURE_PROBE)
+ return kvm_s390_pci_interp_probe(zdev);
+
+ size = sizeof(*feat) + sizeof(*data);
+ feat = kzalloc(size, GFP_KERNEL);
+ if (!feat)
+ return -ENOMEM;
+
+ data = (struct vfio_device_zpci_interp *)&feat->data;
+ minsz = offsetofend(struct vfio_device_feature, flags);
+
+ if (feature.argsz < minsz + sizeof(*data))
+ return -EINVAL;
+
+ /* Get the rest of the payload for GET/SET */
+ rc = copy_from_user(data, (void __user *)(arg + minsz),
+ sizeof(*data));
+ if (rc)
+ rc = -EINVAL;
+
+ if (feature.flags & VFIO_DEVICE_FEATURE_GET) {
+ if (zdev->gisa != 0)
+ data->flags = VFIO_DEVICE_ZPCI_FLAG_INTERP;
+ else
+ data->flags = 0;
+ data->fh = zdev->fh;
+ /* userspace is using host fh, give interpreted clp values */
+ zdev->kzdev->interpretation = true;
+
+ if (copy_to_user((void __user *)arg, feat, size))
+ rc = -EFAULT;
+ } else if (feature.flags & VFIO_DEVICE_FEATURE_SET) {
+ if (data->flags == VFIO_DEVICE_ZPCI_FLAG_INTERP)
+ rc = kvm_s390_pci_interp_enable(zdev);
+ else if (data->flags == 0)
+ rc = kvm_s390_pci_interp_disable(zdev, false);
+ else
+ rc = -EINVAL;
+ } else {
+ /* Neither GET nor SET were specified */
+ rc = -EINVAL;
+ }
+
+ kfree(feat);
+ return rc;
+}
+
static int vfio_pci_zdev_group_notifier(struct notifier_block *nb,
unsigned long action, void *data)
{
@@ -164,6 +229,7 @@ void vfio_pci_zdev_open(struct vfio_pci_core_device *vdev)
return;

zdev->kzdev->nb.notifier_call = vfio_pci_zdev_group_notifier;
+ zdev->kzdev->interpretation = false;

if (vfio_register_notifier(vdev->vdev.dev, VFIO_GROUP_NOTIFY,
&events, &zdev->kzdev->nb))
@@ -180,5 +246,12 @@ void vfio_pci_zdev_release(struct vfio_pci_core_device *vdev)
vfio_unregister_notifier(vdev->vdev.dev, VFIO_GROUP_NOTIFY,
&zdev->kzdev->nb);

+ /*
+ * If the device was using interpretation, don't trust that userspace
+ * did the appropriate cleanup
+ */
+ if (zdev->gisa != 0)
+ kvm_s390_pci_interp_disable(zdev, true);
+
kvm_s390_pci_dev_release(zdev);
}
diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
index 05287f8ac855..0db2b1051931 100644
--- a/include/linux/vfio_pci_core.h
+++ b/include/linux/vfio_pci_core.h
@@ -198,6 +198,9 @@ static inline int vfio_pci_igd_init(struct vfio_pci_core_device *vdev)
#ifdef CONFIG_VFIO_PCI_ZDEV
extern int vfio_pci_info_zdev_add_caps(struct vfio_pci_core_device *vdev,
struct vfio_info_cap *caps);
+int vfio_pci_zdev_feat_interp(struct vfio_pci_core_device *vdev,
+ struct vfio_device_feature feature,
+ unsigned long arg);
void vfio_pci_zdev_open(struct vfio_pci_core_device *vdev);
void vfio_pci_zdev_release(struct vfio_pci_core_device *vdev);
#else
@@ -207,6 +210,13 @@ static inline int vfio_pci_info_zdev_add_caps(struct vfio_pci_core_device *vdev,
return -ENODEV;
}

+static inline int vfio_pci_zdev_feat_interp(struct vfio_pci_core_device *vdev,
+ struct vfio_device_feature feature,
+ unsigned long arg)
+{
+ return -ENOTTY;
+}
+
static inline void vfio_pci_zdev_open(struct vfio_pci_core_device *vdev)
{
}
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index ef33ea002b0b..b9a75485b8e7 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -1002,6 +1002,13 @@ struct vfio_device_feature {
*/
#define VFIO_DEVICE_FEATURE_PCI_VF_TOKEN (0)

+/*
+ * Provide support for enabling interpretation of zPCI instructions. This
+ * feature is only valid for s390x PCI devices. Data provided when setting
+ * and getting this feature is futher described in vfio_zdev.h
+ */
+#define VFIO_DEVICE_FEATURE_ZPCI_INTERP (1)
+
/* -------- API for Type1 VFIO IOMMU -------- */

/**
diff --git a/include/uapi/linux/vfio_zdev.h b/include/uapi/linux/vfio_zdev.h
index b4309397b6b2..575f0410dc66 100644
--- a/include/uapi/linux/vfio_zdev.h
+++ b/include/uapi/linux/vfio_zdev.h
@@ -75,4 +75,19 @@ struct vfio_device_info_cap_zpci_pfip {
__u8 pfip[];
};

+/**
+ * VFIO_DEVICE_FEATURE_ZPCI_INTERP
+ *
+ * This feature is used for enabling zPCI instruction interpretation for a
+ * device. No data is provided when setting this feature. When getting
+ * this feature, the following structure is provided which details whether
+ * or not interpretation is active and provides the guest with host device
+ * information necessary to enable interpretation.
+ */
+struct vfio_device_zpci_interp {
+ __u64 flags;
+#define VFIO_DEVICE_ZPCI_FLAG_INTERP 1
+ __u32 fh; /* Host device function handle */
+};
+
#endif
--
2.27.0


2022-02-09 09:55:40

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v3 24/30] vfio-pci/zdev: wire up group notifier

On Tue, Feb 08, 2022 at 10:43:19AM -0700, Alex Williamson wrote:
> On Fri, 4 Feb 2022 16:15:30 -0500
> Matthew Rosato <[email protected]> wrote:
>
> > KVM zPCI passthrough device logic will need a reference to the associated
> > kvm guest that has access to the device. Let's register a group notifier
> > for VFIO_GROUP_NOTIFY_SET_KVM to catch this information in order to create
> > an association between a kvm guest and the host zdev.
> >
> > Signed-off-by: Matthew Rosato <[email protected]>
> > arch/s390/include/asm/kvm_pci.h | 2 ++
> > drivers/vfio/pci/vfio_pci_core.c | 2 ++
> > drivers/vfio/pci/vfio_pci_zdev.c | 46 ++++++++++++++++++++++++++++++++
> > include/linux/vfio_pci_core.h | 10 +++++++
> > 4 files changed, 60 insertions(+)
> >
> > diff --git a/arch/s390/include/asm/kvm_pci.h b/arch/s390/include/asm/kvm_pci.h
> > index e4696f5592e1..16290b4cf2a6 100644
> > +++ b/arch/s390/include/asm/kvm_pci.h
> > @@ -16,6 +16,7 @@
> > #include <linux/kvm.h>
> > #include <linux/pci.h>
> > #include <linux/mutex.h>
> > +#include <linux/notifier.h>
> > #include <asm/pci_insn.h>
> > #include <asm/pci_dma.h>
> >
> > @@ -32,6 +33,7 @@ struct kvm_zdev {
> > u64 rpcit_count;
> > struct kvm_zdev_ioat ioat;
> > struct zpci_fib fib;
> > + struct notifier_block nb;
> > };
> >
> > int kvm_s390_pci_dev_open(struct zpci_dev *zdev);
> > diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> > index f948e6cd2993..fc57d4d0abbe 100644
> > +++ b/drivers/vfio/pci/vfio_pci_core.c
> > @@ -452,6 +452,7 @@ void vfio_pci_core_close_device(struct vfio_device *core_vdev)
> >
> > vfio_pci_vf_token_user_add(vdev, -1);
> > vfio_spapr_pci_eeh_release(vdev->pdev);
> > + vfio_pci_zdev_release(vdev);
> > vfio_pci_core_disable(vdev);
> >
> > mutex_lock(&vdev->igate);
> > @@ -470,6 +471,7 @@ EXPORT_SYMBOL_GPL(vfio_pci_core_close_device);
> > void vfio_pci_core_finish_enable(struct vfio_pci_core_device *vdev)
> > {
> > vfio_pci_probe_mmaps(vdev);
> > + vfio_pci_zdev_open(vdev);
> > vfio_spapr_pci_eeh_open(vdev->pdev);
> > vfio_pci_vf_token_user_add(vdev, 1);
> > }
>
> If this handling were for a specific device, I think we'd be suggesting
> this is the point at which we cross over to a vendor variant making use
> of vfio-pci-core rather than hooking directly into the core code.

Personally, I think it is wrong layering for VFIO to be aware of KVM
like this. This marks the first time that VFIO core code itself is
being made aware of the KVM linkage.

It copies the same kind of design the s390 specific mdev use of
putting VFIO in charge of KVM functionality. If we are doing this we
should just give up and admit that KVM is a first-class part of struct
vfio_device and get rid of the notifier stuff too, at least for s390.

Reading the patches and descriptions pretty much everything is boiling
down to 'use vfio to tell the kvm architecture code to do something' -
which I think needs to be handled through a KVM side ioctl.

Or, at the very least, everything needs to be described in some way
that makes it clear what is happening to userspace, without kvm,
through these ioctls.

This seems especially true now that it seems s390 PCI support is
almost truely functional, with actual new userspace instructions to
issue MMIO operations that work outside of KVM.

I'm not sure how this all fits together, but I would expect an outcome
where DPDK could run on these new systems and not have to know
anything more about s390 beyond using the proper MMIO instructions via
some compilation time enablement.

(I've been reviewing s390 patches updating rdma for a parallel set of
stuff)

> this is meant to extend vfio-pci proper for the whole arch. Is there a
> compromise in using #ifdefs in vfio_pci_ops to call into zpci specific
> code that implements these arch specific hooks and the core for
> everything else? SPAPR code could probably converted similarly, it
> exists here for legacy reasons. [Cc Jason]

I'm not sure I get what you are suggesting? Where would these ifdefs
be?

> Also, please note the DEVICE_FEATURE generalizations in the latest
> series from NVIDIA for mlx5 migration support:

> https://lore.kernel.org/all/[email protected]/

Yes, please don't implement a bunch of new FEATURE code without taking
the cleanup patches for feature support from that series too.

I can put them on a branch for you if you needed.

Jason

2022-02-09 10:11:13

by Matthew Rosato

[permalink] [raw]
Subject: Re: [PATCH v3 24/30] vfio-pci/zdev: wire up group notifier

On 2/8/22 2:26 PM, Alex Williamson wrote:
> On Tue, 8 Feb 2022 14:51:41 -0400
> Jason Gunthorpe <[email protected]> wrote:
>
>> On Tue, Feb 08, 2022 at 10:43:19AM -0700, Alex Williamson wrote:
>>> On Fri, 4 Feb 2022 16:15:30 -0500
>>> Matthew Rosato <[email protected]> wrote:
>>>
>>>> KVM zPCI passthrough device logic will need a reference to the associated
>>>> kvm guest that has access to the device. Let's register a group notifier
>>>> for VFIO_GROUP_NOTIFY_SET_KVM to catch this information in order to create
>>>> an association between a kvm guest and the host zdev.
>>>>
>>>> Signed-off-by: Matthew Rosato <[email protected]>
>>>> arch/s390/include/asm/kvm_pci.h | 2 ++
>>>> drivers/vfio/pci/vfio_pci_core.c | 2 ++
>>>> drivers/vfio/pci/vfio_pci_zdev.c | 46 ++++++++++++++++++++++++++++++++
>>>> include/linux/vfio_pci_core.h | 10 +++++++
>>>> 4 files changed, 60 insertions(+)
>>>>
>>>> diff --git a/arch/s390/include/asm/kvm_pci.h b/arch/s390/include/asm/kvm_pci.h
>>>> index e4696f5592e1..16290b4cf2a6 100644
>>>> +++ b/arch/s390/include/asm/kvm_pci.h
>>>> @@ -16,6 +16,7 @@
>>>> #include <linux/kvm.h>
>>>> #include <linux/pci.h>
>>>> #include <linux/mutex.h>
>>>> +#include <linux/notifier.h>
>>>> #include <asm/pci_insn.h>
>>>> #include <asm/pci_dma.h>
>>>>
>>>> @@ -32,6 +33,7 @@ struct kvm_zdev {
>>>> u64 rpcit_count;
>>>> struct kvm_zdev_ioat ioat;
>>>> struct zpci_fib fib;
>>>> + struct notifier_block nb;
>>>> };
>>>>
>>>> int kvm_s390_pci_dev_open(struct zpci_dev *zdev);
>>>> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
>>>> index f948e6cd2993..fc57d4d0abbe 100644
>>>> +++ b/drivers/vfio/pci/vfio_pci_core.c
>>>> @@ -452,6 +452,7 @@ void vfio_pci_core_close_device(struct vfio_device *core_vdev)
>>>>
>>>> vfio_pci_vf_token_user_add(vdev, -1);
>>>> vfio_spapr_pci_eeh_release(vdev->pdev);
>>>> + vfio_pci_zdev_release(vdev);
>>>> vfio_pci_core_disable(vdev);
>>>>
>>>> mutex_lock(&vdev->igate);
>>>> @@ -470,6 +471,7 @@ EXPORT_SYMBOL_GPL(vfio_pci_core_close_device);
>>>> void vfio_pci_core_finish_enable(struct vfio_pci_core_device *vdev)
>>>> {
>>>> vfio_pci_probe_mmaps(vdev);
>>>> + vfio_pci_zdev_open(vdev);
>>>> vfio_spapr_pci_eeh_open(vdev->pdev);
>>>> vfio_pci_vf_token_user_add(vdev, 1);
>>>> }
>>>
>>> If this handling were for a specific device, I think we'd be suggesting
>>> this is the point at which we cross over to a vendor variant making use
>>> of vfio-pci-core rather than hooking directly into the core code.
>>
>> Personally, I think it is wrong layering for VFIO to be aware of KVM
>> like this. This marks the first time that VFIO core code itself is
>> being made aware of the KVM linkage.
>
> I agree, but I've resigned that I've lost that battle. Both mdev vGPU
> vendors make specific assumptions about running on a VM. VFIO was
> never intended to be tied to KVM or the specific use case of a VM.
>
>> It copies the same kind of design the s390 specific mdev use of
>> putting VFIO in charge of KVM functionality. If we are doing this we
>> should just give up and admit that KVM is a first-class part of struct
>> vfio_device and get rid of the notifier stuff too, at least for s390.
>
> Euw. You're right, I really don't like vfio core code embracing this
> dependency for s390, device specific use cases are bad enough.
>
>> Reading the patches and descriptions pretty much everything is boiling
>> down to 'use vfio to tell the kvm architecture code to do something' -
>> which I think needs to be handled through a KVM side ioctl.
>
> AIF at least sounds a lot like the reason we invented the irq bypass
> mechanism to allow interrupt producers and consumers to register
> independently and associate to each other with a shared token.

Yes, these do sound quite similar, looking at it now though I haven't
yet fully grokked irq bypass... But with AIF you have the case where
either the interrupt will be delivered directly to a guest from firmware
via an s390 construct (gisa) or under various circumstances the host
(kvm) will be prodded to perform the delivery (still via gisa) instead.

>
> Is the purpose of IOAT to associate the device to a set of KVM page
> tables? That seems like a container or future iommufd operation. I

Yes, here we are establishing a relationship with the DMA table in the
guest so that once mappings are established guest PCI operations
(handled via special instructions in s390) don't need to go through the
host but can be directly handled by firmware (so, effectively guest can
keep running on its vcpu vs breaking out).

> read DTSM as supported formats for the IOAT.
>
>> Or, at the very least, everything needs to be described in some way
>> that makes it clear what is happening to userspace, without kvm,
>> through these ioctls.

Nothing, they don't need these ioctls. Userspace without a KVM
registration for the device in question gets -EINVAL.

>
> As I understand the discussion here:
>
> https://lore.kernel.org/all/[email protected]/
>
> The assumption is that there is no non-KVM userspace currently. This
> seems like a regression to me.

It's more that non-KVM userspace doesn't care about what these ioctls
are doing... The enabling of 'interp, aif, ioat' is only pertinent when
there is a KVM userspace, specifically because the information being
shared / actions being performed as a result are only relevant to
properly enabling zPCI features when the zPCI device is being passed
through to a VM guest. If you're just using a userspace driver to talk
to the device (no KVM guest involved) then the kernel zPCI layer already
has this device set up using whatever s390 facilities are available.

>
>> This seems especially true now that it seems s390 PCI support is
>> almost truely functional, with actual new userspace instructions to
>> issue MMIO operations that work outside of KVM.
>>
>> I'm not sure how this all fits together, but I would expect an outcome
>> where DPDK could run on these new systems and not have to know
>> anything more about s390 beyond using the proper MMIO instructions via
>> some compilation time enablement.
>
> Yes, fully enabling zPCI with vfio, but only for KVM is not optimal.

See above. I think there is a misunderstanding here, it's not that we
are only enabling zPCI with vfio for KVM, but rather than when using
vfio to pass the device to a guest there is additional work that has to
happen in order to 'fully enable' zPCI.

>
>> (I've been reviewing s390 patches updating rdma for a parallel set of
>> stuff)
>>
>>> this is meant to extend vfio-pci proper for the whole arch. Is there a
>>> compromise in using #ifdefs in vfio_pci_ops to call into zpci specific
>>> code that implements these arch specific hooks and the core for
>>> everything else? SPAPR code could probably converted similarly, it
>>> exists here for legacy reasons. [Cc Jason]
>>
>> I'm not sure I get what you are suggesting? Where would these ifdefs
>> be?
>
> Essentially just:
>
> static const struct vfio_device_ops vfio_pci_ops = {
> .name = "vfio-pci",
> #ifdef CONFIG_S390
> .open_device = vfio_zpci_open_device,
> .close_device = vfio_zpci_close_device,
> .ioctl = vfio_zpci_ioctl,
> #else
> .open_device = vfio_pci_open_device,
> .close_device = vfio_pci_core_close_device,
> .ioctl = vfio_pci_core_ioctl,
> #endif
> .read = vfio_pci_core_read,
> .write = vfio_pci_core_write,
> .mmap = vfio_pci_core_mmap,
> .request = vfio_pci_core_request,
> .match = vfio_pci_core_match,
> };
>
> It would at least provide more validation/exercise of the core/vendor
> split. Thanks,
>
> Alex
>


2022-02-09 12:22:40

by Alex Williamson

[permalink] [raw]
Subject: Re: [PATCH v3 24/30] vfio-pci/zdev: wire up group notifier

On Tue, 8 Feb 2022 14:51:41 -0400
Jason Gunthorpe <[email protected]> wrote:

> On Tue, Feb 08, 2022 at 10:43:19AM -0700, Alex Williamson wrote:
> > On Fri, 4 Feb 2022 16:15:30 -0500
> > Matthew Rosato <[email protected]> wrote:
> >
> > > KVM zPCI passthrough device logic will need a reference to the associated
> > > kvm guest that has access to the device. Let's register a group notifier
> > > for VFIO_GROUP_NOTIFY_SET_KVM to catch this information in order to create
> > > an association between a kvm guest and the host zdev.
> > >
> > > Signed-off-by: Matthew Rosato <[email protected]>
> > > arch/s390/include/asm/kvm_pci.h | 2 ++
> > > drivers/vfio/pci/vfio_pci_core.c | 2 ++
> > > drivers/vfio/pci/vfio_pci_zdev.c | 46 ++++++++++++++++++++++++++++++++
> > > include/linux/vfio_pci_core.h | 10 +++++++
> > > 4 files changed, 60 insertions(+)
> > >
> > > diff --git a/arch/s390/include/asm/kvm_pci.h b/arch/s390/include/asm/kvm_pci.h
> > > index e4696f5592e1..16290b4cf2a6 100644
> > > +++ b/arch/s390/include/asm/kvm_pci.h
> > > @@ -16,6 +16,7 @@
> > > #include <linux/kvm.h>
> > > #include <linux/pci.h>
> > > #include <linux/mutex.h>
> > > +#include <linux/notifier.h>
> > > #include <asm/pci_insn.h>
> > > #include <asm/pci_dma.h>
> > >
> > > @@ -32,6 +33,7 @@ struct kvm_zdev {
> > > u64 rpcit_count;
> > > struct kvm_zdev_ioat ioat;
> > > struct zpci_fib fib;
> > > + struct notifier_block nb;
> > > };
> > >
> > > int kvm_s390_pci_dev_open(struct zpci_dev *zdev);
> > > diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> > > index f948e6cd2993..fc57d4d0abbe 100644
> > > +++ b/drivers/vfio/pci/vfio_pci_core.c
> > > @@ -452,6 +452,7 @@ void vfio_pci_core_close_device(struct vfio_device *core_vdev)
> > >
> > > vfio_pci_vf_token_user_add(vdev, -1);
> > > vfio_spapr_pci_eeh_release(vdev->pdev);
> > > + vfio_pci_zdev_release(vdev);
> > > vfio_pci_core_disable(vdev);
> > >
> > > mutex_lock(&vdev->igate);
> > > @@ -470,6 +471,7 @@ EXPORT_SYMBOL_GPL(vfio_pci_core_close_device);
> > > void vfio_pci_core_finish_enable(struct vfio_pci_core_device *vdev)
> > > {
> > > vfio_pci_probe_mmaps(vdev);
> > > + vfio_pci_zdev_open(vdev);
> > > vfio_spapr_pci_eeh_open(vdev->pdev);
> > > vfio_pci_vf_token_user_add(vdev, 1);
> > > }
> >
> > If this handling were for a specific device, I think we'd be suggesting
> > this is the point at which we cross over to a vendor variant making use
> > of vfio-pci-core rather than hooking directly into the core code.
>
> Personally, I think it is wrong layering for VFIO to be aware of KVM
> like this. This marks the first time that VFIO core code itself is
> being made aware of the KVM linkage.

I agree, but I've resigned that I've lost that battle. Both mdev vGPU
vendors make specific assumptions about running on a VM. VFIO was
never intended to be tied to KVM or the specific use case of a VM.

> It copies the same kind of design the s390 specific mdev use of
> putting VFIO in charge of KVM functionality. If we are doing this we
> should just give up and admit that KVM is a first-class part of struct
> vfio_device and get rid of the notifier stuff too, at least for s390.

Euw. You're right, I really don't like vfio core code embracing this
dependency for s390, device specific use cases are bad enough.

> Reading the patches and descriptions pretty much everything is boiling
> down to 'use vfio to tell the kvm architecture code to do something' -
> which I think needs to be handled through a KVM side ioctl.

AIF at least sounds a lot like the reason we invented the irq bypass
mechanism to allow interrupt producers and consumers to register
independently and associate to each other with a shared token.

Is the purpose of IOAT to associate the device to a set of KVM page
tables? That seems like a container or future iommufd operation. I
read DTSM as supported formats for the IOAT.

> Or, at the very least, everything needs to be described in some way
> that makes it clear what is happening to userspace, without kvm,
> through these ioctls.

As I understand the discussion here:

https://lore.kernel.org/all/[email protected]/

The assumption is that there is no non-KVM userspace currently. This
seems like a regression to me.

> This seems especially true now that it seems s390 PCI support is
> almost truely functional, with actual new userspace instructions to
> issue MMIO operations that work outside of KVM.
>
> I'm not sure how this all fits together, but I would expect an outcome
> where DPDK could run on these new systems and not have to know
> anything more about s390 beyond using the proper MMIO instructions via
> some compilation time enablement.

Yes, fully enabling zPCI with vfio, but only for KVM is not optimal.

> (I've been reviewing s390 patches updating rdma for a parallel set of
> stuff)
>
> > this is meant to extend vfio-pci proper for the whole arch. Is there a
> > compromise in using #ifdefs in vfio_pci_ops to call into zpci specific
> > code that implements these arch specific hooks and the core for
> > everything else? SPAPR code could probably converted similarly, it
> > exists here for legacy reasons. [Cc Jason]
>
> I'm not sure I get what you are suggesting? Where would these ifdefs
> be?

Essentially just:

static const struct vfio_device_ops vfio_pci_ops = {
.name = "vfio-pci",
#ifdef CONFIG_S390
.open_device = vfio_zpci_open_device,
.close_device = vfio_zpci_close_device,
.ioctl = vfio_zpci_ioctl,
#else
.open_device = vfio_pci_open_device,
.close_device = vfio_pci_core_close_device,
.ioctl = vfio_pci_core_ioctl,
#endif
.read = vfio_pci_core_read,
.write = vfio_pci_core_write,
.mmap = vfio_pci_core_mmap,
.request = vfio_pci_core_request,
.match = vfio_pci_core_match,
};

It would at least provide more validation/exercise of the core/vendor
split. Thanks,

Alex


2022-02-09 12:23:23

by Matthew Rosato

[permalink] [raw]
Subject: Re: [PATCH v3 24/30] vfio-pci/zdev: wire up group notifier

On 2/8/22 3:40 PM, Jason Gunthorpe wrote:
> On Tue, Feb 08, 2022 at 03:33:58PM -0500, Matthew Rosato wrote:
>
>>> Is the purpose of IOAT to associate the device to a set of KVM page
>>> tables? That seems like a container or future iommufd operation. I
>>
>> Yes, here we are establishing a relationship with the DMA table in the guest
>> so that once mappings are established guest PCI operations (handled via
>> special instructions in s390) don't need to go through the host but can be
>> directly handled by firmware (so, effectively guest can keep running on its
>> vcpu vs breaking out).
>
> Oh, well, certainly sounds like a NAK on that - anything to do with
> the DMA translation of a PCI device must go through the iommu layer,
> not here.
>
> Lets not repeat the iommu subsytem bypass mess power made please.
>
>> It's more that non-KVM userspace doesn't care about what these ioctls are
>> doing... The enabling of 'interp, aif, ioat' is only pertinent when there
>> is a KVM userspace, specifically because the information being shared /
>> actions being performed as a result are only relevant to properly enabling
>> zPCI features when the zPCI device is being passed through to a VM
>> guest.
>
> Then why are they KVM ioctls?

Well, the primary reason I ended up here was that I need to ensure the
the operation is only performed when guest X owns host zPCI device Y.
The vfio-pci device ioctl had the benefit of acting on device
granularity + also already being aware of the host PCI (and thus zPCI)
device association -- so I already know exactly what hostdev is being
referenced for the operation. All that was needed was the KVM notifier
to ensure the vfio device was associated to a KVM guest.

I think moving over to a KVM ioctl is doable; I might still need to rely
on VFIO_GROUP_NOTIFY_SET_KVM though, not sure yet.

Based on prior comments in this thread I'm assuming Alex shares that
view too (don't use vfio device ioctl for something only being used for
VM passthrough) so I'll start looking at using KVM ioctls instead.


2022-02-09 12:29:48

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v3 24/30] vfio-pci/zdev: wire up group notifier

On Tue, Feb 08, 2022 at 03:33:58PM -0500, Matthew Rosato wrote:

> > Is the purpose of IOAT to associate the device to a set of KVM page
> > tables? That seems like a container or future iommufd operation. I
>
> Yes, here we are establishing a relationship with the DMA table in the guest
> so that once mappings are established guest PCI operations (handled via
> special instructions in s390) don't need to go through the host but can be
> directly handled by firmware (so, effectively guest can keep running on its
> vcpu vs breaking out).

Oh, well, certainly sounds like a NAK on that - anything to do with
the DMA translation of a PCI device must go through the iommu layer,
not here.

Lets not repeat the iommu subsytem bypass mess power made please.

> It's more that non-KVM userspace doesn't care about what these ioctls are
> doing... The enabling of 'interp, aif, ioat' is only pertinent when there
> is a KVM userspace, specifically because the information being shared /
> actions being performed as a result are only relevant to properly enabling
> zPCI features when the zPCI device is being passed through to a VM
> guest.

Then why are they KVM ioctls?

Jason

2022-02-10 14:11:16

by Niklas Schnelle

[permalink] [raw]
Subject: Re: [PATCH v3 24/30] vfio-pci/zdev: wire up group notifier

On Tue, 2022-02-08 at 16:40 -0400, Jason Gunthorpe wrote:
> On Tue, Feb 08, 2022 at 03:33:58PM -0500, Matthew Rosato wrote:
>
> > > Is the purpose of IOAT to associate the device to a set of KVM page
> > > tables? That seems like a container or future iommufd operation. I
> >
> > Yes, here we are establishing a relationship with the DMA table in the guest
> > so that once mappings are established guest PCI operations (handled via
> > special instructions in s390) don't need to go through the host but can be
> > directly handled by firmware (so, effectively guest can keep running on its
> > vcpu vs breaking out).
>
> Oh, well, certainly sounds like a NAK on that - anything to do with
> the DMA translation of a PCI device must go through the iommu layer,
> not here.
>
> Lets not repeat the iommu subsytem bypass mess power made please.

Maybe some context on all of this. First it's important to note that on
s390x the PCI IOMMU hardware is controlled with special instructions.
For pass-through this is actually quite nice as it makes it relatively
simple for us to always run with an IOMMU in the guest we simply need
to provide the instructions. Meaning we get full IOMMU protection for
pass-through devices on KVM guests, guests with pass-through remain
pageable and we can even support nested pass-through.

This is possible with relatively little overhead because we can do all
of the per map/unmap guest IOMMU operations with a single instruction
intercept. The instruction we need to intercept is called Refresh PCI
Translations (RPCIT). It's job is twofold.

For an OS running directly on our machine hypervisor LPAR it flushes
the IOMMU's TLB by informing it which pages have been invalidated while
the hardware walks the page tables and fills the TLB on it's own for
establishing a mapping for previously invalid IOVAs.

In a KVM or z/VM guest the guest is informed that IOMMU translations
need to be refreshed even for previously invalid IOVAs. With this the
guest builds it's IOMMU translation tables as normal but then does a
RPCIT for the IOVA range it touched. In the hypervisor we can then
simply walk the translation tables, pin the guest pages and map them in
the host IOMMU. Prior to this series this happened in QEMU which does
the map via vfio-iommu-type1 from user-space. This works and will
remain as a fallback. Sadly it is quite slow and has a large impact on
performance as we need to do a lot of mapping operations as the DMA API
of the guest goes through the virtual IOMMU. This series thus adds the
same functionality but as a KVM intercept of RPCIT. Now I think this
neatly fits into KVM, we're emulating an instruction after all and most
of its work is KVM specific pinning of guest pages. Importantly all
other handling like IOMMU domain attachment still goes through vfio-
iommu-type1 and we just fast path the map/unmap operations.

In the code the map/unmap boils down to dma_walk_cpu_trans() and parts
of dma_shadow_cpu_trans() both called in dma_table_shadow(). The former
is a function already shared between our DMA API and IOMMU API
implementations and the only code that walks the host translation
tables. So in a way we're side stepping the IOMMU API ops that is true
but we do not side step the IOMMU host table access code paths. Notice
how our IOMMU API is also < 400 LOC because both the DMA and IOMMU APIs
share code.

That said, I believe we should be able to do the mapping still in a KVM
RPCIT intercept but going through IOMMU API ops if this side stepping
is truly unacceptable. It definitely adds overhead though and I'm not
sure what we gain in clarity or maintainability since we already share
the actual host table access code and there is only one PCI IOMMU and
that is part of the architecture. Also either KVM or QEMU needs to know
about the same details for looking at guest IOMMU translation tables /
emulating the guest IOMMU. It's also clear that the IOMMU API will
remain functional on its own as it is necesssary for any non-KVM use
case which of course can't intercept RPCIT but on the other hand can
also keep mappings much longer signficantly reducing overhead.


2022-02-10 15:58:36

by Niklas Schnelle

[permalink] [raw]
Subject: Re: [PATCH v3 24/30] vfio-pci/zdev: wire up group notifier

On Thu, 2022-02-10 at 09:01 -0400, Jason Gunthorpe wrote:
> On Thu, Feb 10, 2022 at 12:15:58PM +0100, Niklas Schnelle wrote:
>
> > In a KVM or z/VM guest the guest is informed that IOMMU translations
> > need to be refreshed even for previously invalid IOVAs. With this the
> > guest builds it's IOMMU translation tables as normal but then does a
> > RPCIT for the IOVA range it touched. In the hypervisor we can then
> > simply walk the translation tables, pin the guest pages and map them in
> > the host IOMMU. Prior to this series this happened in QEMU which does
> > the map via vfio-iommu-type1 from user-space. This works and will
> > remain as a fallback. Sadly it is quite slow and has a large impact on
> > performance as we need to do a lot of mapping operations as the DMA API
> > of the guest goes through the virtual IOMMU. This series thus adds the
> > same functionality but as a KVM intercept of RPCIT. Now I think this
> > neatly fits into KVM, we're emulating an instruction after all and most
> > of its work is KVM specific pinning of guest pages. Importantly all
> > other handling like IOMMU domain attachment still goes through vfio-
> > iommu-type1 and we just fast path the map/unmap operations.
>
> So you create an iommu_domain and then hand it over to kvm which then
> does map/unmap operations on it under the covers?

Yes

>
> How does the page pinning work?

The pinning is done directly in the RPCIT interception handler pinning
both the IOMMU tables and the guest pages mapped for DMA.

>
> In the design we are trying to reach I would say this needs to be
> modeled as a special iommu_domain that has this automatic map/unmap
> behavior from following user pages. Creating it would specify the kvm
> and the in-guest base address of the guest's page table.

Makes sense.

> Then the
> magic kernel code you describe can operate on its own domain without
> becoming confused with a normal map/unmap domain.

This sounds like an interesting idea. Looking at
drivers/iommu/s390_iommu.c most of that is pretty trivial domain
handling. I wonder if we could share this by marking the existing
s390_iommu_domain type with kind of a "lent out to KVM" flag. Possibly
by simply having a non-NULL pointer to a struct holding the guest base
address and kvm etc? That way we can share the setup/tear down of the
domain and of host IOMMU tables as well as aperture checks the same
while also being able to keep the IOMMU API from interfering with the
KVM RPCIT intercept and vice versa. I.e. while the domain is under
control of KVM's RPCIT handling we make all IOMMU map/unmap fail.

To me this more direct involvement of IOMMU and KVM on s390x is also a
direct consequence of it using special instructions. Naturally those
instructions can be intercepted or run under hardware accelerated
virtualization.

>
> It is like the HW nested translation other CPUs are doing, but instead
> of HW nested, it is SW nested.

Yes very good analogy. Has any of that nested IOMMU translations work
been merged yet? I know AMD had something in the works and nested
translations have been around for the MMU for a while and are also used
on s390x. We're definitely thinking about HW nested IOMMU translations
too so any design we come up with would be able to deal with that too.
Basically we would then execute RPCIT without leaving the hardware
virtualization mode (SIE). We believe that that would require pinning
all of guest memory though because HW can't really pin pages.


2022-02-10 21:42:17

by Matthew Rosato

[permalink] [raw]
Subject: Re: [PATCH v3 24/30] vfio-pci/zdev: wire up group notifier

On 2/10/22 10:23 AM, Jason Gunthorpe wrote:
> On Thu, Feb 10, 2022 at 03:06:35PM +0100, Niklas Schnelle wrote:
>
>>> How does the page pinning work?
>>
>> The pinning is done directly in the RPCIT interception handler pinning
>> both the IOMMU tables and the guest pages mapped for DMA.
>
> And if pinning fails?

The RPCIT instruction is goes back to the guest with an indication that
informs it the operation failed / gives it impetus to kick off a guest
DMA refresh and clear up space (unpin).

>
>>> Then the
>>> magic kernel code you describe can operate on its own domain without
>>> becoming confused with a normal map/unmap domain.
>>
>> This sounds like an interesting idea. Looking at
>> drivers/iommu/s390_iommu.c most of that is pretty trivial domain
>> handling. I wonder if we could share this by marking the existing
>> s390_iommu_domain type with kind of a "lent out to KVM" flag.
>
> Lu has posted a series here:
>
> https://lore.kernel.org/linux-iommu/[email protected]
>
> Which allows the iommu driver to create a domain with unique ops, so
> you'd just fork the entire thing, have your own struct
> s390_kvm_iommu_domain and related ops.
>

OK, looking into this, thanks for the pointer... Sounds to me like we
then want to make the determination upfront and then ensure the right
iommu domain ops are registered for the device sometime before creation,
based upon the usecase -- general userspace: s390_iommu_ops (existing),
kvm: s390_kvm_iommu_domain (new).

> When the special creation flow is triggered you'd just create one of
> these with the proper ops already setup. >
> We are imagining a special ioctl to create these things and each IOMMU
> HW driver can supply a unique implementation suited to their HW
> design.

But I haven't connected the dots on this part -- At the end of the day
for this 'special creation flow' I need the kvm + starting point of the
guest table + format before we let the new s390_kvm_iommu_domain start
doing automatic map/unmap during RPCIT intercept -- This initial setup
has to come from a special ioctl as you say, but where do you see it
living? I could certainly roll my own via a KVM ioctl or whatever, but
it sounds like you're also referring to a general-purpose ioctl to
encompass each of the different unique implementations, with this s390
kvm approach being one.


2022-02-10 23:29:48

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v3 24/30] vfio-pci/zdev: wire up group notifier

On Thu, Feb 10, 2022 at 12:15:58PM +0100, Niklas Schnelle wrote:

> In a KVM or z/VM guest the guest is informed that IOMMU translations
> need to be refreshed even for previously invalid IOVAs. With this the
> guest builds it's IOMMU translation tables as normal but then does a
> RPCIT for the IOVA range it touched. In the hypervisor we can then
> simply walk the translation tables, pin the guest pages and map them in
> the host IOMMU. Prior to this series this happened in QEMU which does
> the map via vfio-iommu-type1 from user-space. This works and will
> remain as a fallback. Sadly it is quite slow and has a large impact on
> performance as we need to do a lot of mapping operations as the DMA API
> of the guest goes through the virtual IOMMU. This series thus adds the
> same functionality but as a KVM intercept of RPCIT. Now I think this
> neatly fits into KVM, we're emulating an instruction after all and most
> of its work is KVM specific pinning of guest pages. Importantly all
> other handling like IOMMU domain attachment still goes through vfio-
> iommu-type1 and we just fast path the map/unmap operations.

So you create an iommu_domain and then hand it over to kvm which then
does map/unmap operations on it under the covers?

How does the page pinning work?

In the design we are trying to reach I would say this needs to be
modeled as a special iommu_domain that has this automatic map/unmap
behavior from following user pages. Creating it would specify the kvm
and the in-guest base address of the guest's page table. Then the
magic kernel code you describe can operate on its own domain without
becoming confused with a normal map/unmap domain.

It is like the HW nested translation other CPUs are doing, but instead
of HW nested, it is SW nested.

Jason

2022-02-11 00:08:37

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v3 24/30] vfio-pci/zdev: wire up group notifier

On Thu, Feb 10, 2022 at 01:59:35PM -0500, Matthew Rosato wrote:

> OK, looking into this, thanks for the pointer... Sounds to me like we then
> want to make the determination upfront and then ensure the right iommu
> domain ops are registered for the device sometime before creation, based
> upon the usecase -- general userspace: s390_iommu_ops (existing), kvm:
> s390_kvm_iommu_domain (new).

Yes, that is the idea. I expect there will be many types of these
special iommu domains. eg Intel has talked about directly using the
KVM CPU page tabe as the IOMMU page table.

> > When the special creation flow is triggered you'd just create one of
> > these with the proper ops already setup. >
> > We are imagining a special ioctl to create these things and each IOMMU
> > HW driver can supply a unique implementation suited to their HW
> > design.
>
> But I haven't connected the dots on this part -- At the end of the day for
> this 'special creation flow' I need the kvm + starting point of the guest
> table + format before we let the new s390_kvm_iommu_domain start doing
> automatic map/unmap during RPCIT intercept -- This initial setup has to come
> from a special ioctl as you say, but where do you see it living? I could
> certainly roll my own via a KVM ioctl or whatever, but it sounds like you're
> also referring to a general-purpose ioctl to encompass each of the different
> unique implementations, with this s390 kvm approach being one.

So, the ioctl will need, as input, a kvm FD and an iommufd FD, and
additional IOMMU driver specific data (format, starting, etc).

The kvm supplies the context for the RPCIT to be captured in

The result is the creation of an iommu_domain inside iommufd, done by
some iommu_ops->alloc_domain_xxxx() driver callback

Which FD has the ioctl it is a bit of an aesthetic choice, but I
predict that iommufd makes the most sense since an object is being
created inside iommfd.

This flow is very similar to the 'userspace page table' flow others
are looking at, but has the extra twist that a KVM FD is needed to supply
the CPU page table.

It may overlap nicely with the intel direction I mentioned. It is just
ugly layering wise that KVM is getting shoved into platform code and
uapis all over the place, but I suppose that is unavoidable. And the
required loose coupling with the kvm module means all kinds of
symbol_gets'etc :(

Jason



2022-02-11 04:57:52

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v3 24/30] vfio-pci/zdev: wire up group notifier

On Thu, Feb 10, 2022 at 03:06:35PM +0100, Niklas Schnelle wrote:

> > How does the page pinning work?
>
> The pinning is done directly in the RPCIT interception handler pinning
> both the IOMMU tables and the guest pages mapped for DMA.

And if pinning fails?

> > Then the
> > magic kernel code you describe can operate on its own domain without
> > becoming confused with a normal map/unmap domain.
>
> This sounds like an interesting idea. Looking at
> drivers/iommu/s390_iommu.c most of that is pretty trivial domain
> handling. I wonder if we could share this by marking the existing
> s390_iommu_domain type with kind of a "lent out to KVM" flag.

Lu has posted a series here:

https://lore.kernel.org/linux-iommu/[email protected]

Which allows the iommu driver to create a domain with unique ops, so
you'd just fork the entire thing, have your own struct
s390_kvm_iommu_domain and related ops.

When the special creation flow is triggered you'd just create one of
these with the proper ops already setup.

We are imagining a special ioctl to create these things and each IOMMU
HW driver can supply a unique implementation suited to their HW
design.

> KVM RPCIT intercept and vice versa. I.e. while the domain is under
> control of KVM's RPCIT handling we make all IOMMU map/unmap fail.

It is not "under the control of" the domain would be created as linked
to kvm and would never, ever, be anything else.

> To me this more direct involvement of IOMMU and KVM on s390x is also a
> direct consequence of it using special instructions. Naturally those
> instructions can be intercepted or run under hardware accelerated
> virtualization.

Well, no, you've just created a kernel-side SW emulated nested
translation scheme. Other CPUs have talked about doing this too, but
nobody has attempted it.

You can make the same argument for any CPU's scheme, a trapped mmio
store is not fundamentally any different from a special instruction
that traps, other than how the information is transferred.

> Yes very good analogy. Has any of that nested IOMMU translations work
> been merged yet?

No. We are making quiet progress, slowly though. I'll add your
interest to my list

> too. Basically we would then execute RPCIT without leaving the
> hardware virtualization mode (SIE). We believe that that would
> require pinning all of guest memory though because HW can't really
> pin pages.

Right, this is what other iommu HW will have to do.

Jason

2022-02-14 15:00:17

by Pierre Morel

[permalink] [raw]
Subject: Re: [PATCH v3 19/30] KVM: s390: pci: provide routines for enabling/disabling interpretation



On 2/4/22 22:15, Matthew Rosato wrote:
> These routines will be wired into the vfio_pci_zdev ioctl handlers to
> respond to requests to enable / disable a device for zPCI Load/Store
> interpretation.
>
> The first time such a request is received, enable the necessary facilities
> for the guest.
>
> Signed-off-by: Matthew Rosato <[email protected]>
> ---
> arch/s390/include/asm/kvm_pci.h | 4 ++
> arch/s390/kvm/pci.c | 102 ++++++++++++++++++++++++++++++++
> arch/s390/pci/pci.c | 3 +
> 3 files changed, 109 insertions(+)
>
> diff --git a/arch/s390/include/asm/kvm_pci.h b/arch/s390/include/asm/kvm_pci.h
> index ef10f9e46e37..422701d526dd 100644
> --- a/arch/s390/include/asm/kvm_pci.h
> +++ b/arch/s390/include/asm/kvm_pci.h
> @@ -24,4 +24,8 @@ struct kvm_zdev {
> int kvm_s390_pci_dev_open(struct zpci_dev *zdev);
> void kvm_s390_pci_dev_release(struct zpci_dev *zdev);
>
> +int kvm_s390_pci_interp_probe(struct zpci_dev *zdev);
> +int kvm_s390_pci_interp_enable(struct zpci_dev *zdev);
> +int kvm_s390_pci_interp_disable(struct zpci_dev *zdev);
> +
> #endif /* ASM_KVM_PCI_H */
> diff --git a/arch/s390/kvm/pci.c b/arch/s390/kvm/pci.c
> index 9b8390133e15..16bef3935284 100644
> --- a/arch/s390/kvm/pci.c
> +++ b/arch/s390/kvm/pci.c
> @@ -12,7 +12,9 @@
> #include <asm/kvm_pci.h>
> #include <asm/pci.h>
> #include <asm/pci_insn.h>
> +#include <asm/sclp.h>
> #include "pci.h"
> +#include "kvm-s390.h"
>
> struct zpci_aift *aift;
>
> @@ -153,6 +155,106 @@ int kvm_s390_pci_aen_init(u8 nisc)
> return rc;
> }
>
> +int kvm_s390_pci_interp_probe(struct zpci_dev *zdev)
> +{
> + /* Must have appropriate hardware facilities */
> + if (!sclp.has_zpci_lsi || !sclp.has_aisii || !sclp.has_aeni ||
> + !sclp.has_aisi || !test_facility(69) || !test_facility(70) ||
> + !test_facility(71) || !test_facility(72)) {
> + return -EINVAL;
> + }

I do not think we need to check STFLE facilities when the SCLP bit
indicating the interpretation of a facility is installed the STFLE bit
indicating the interpreted facility is also installed.


Otherwise, look good to me, with this change:
Reviewed-by: Pierre Morel <[email protected]>

> +
> + /* Must have a KVM association registered */
> + if (!zdev->kzdev || !zdev->kzdev->kvm)
> + return -EINVAL;
> +
> + return 0;
> +}
> +EXPORT_SYMBOL_GPL(kvm_s390_pci_interp_probe);
> +
> +int kvm_s390_pci_interp_enable(struct zpci_dev *zdev)
> +{
> + u32 gisa;
> + int rc;
> +
> + if (!zdev->kzdev || !zdev->kzdev->kvm)
> + return -EINVAL;
> +
> + /*
> + * If this is the first request to use an interpreted device, make the
> + * necessary vcpu changes
> + */
> + if (!zdev->kzdev->kvm->arch.use_zpci_interp)
> + kvm_s390_vcpu_pci_enable_interp(zdev->kzdev->kvm);
> +
> + /*
> + * In the event of a system reset in userspace, the GISA designation
> + * may still be assigned because the device is still enabled.
> + * Verify it's the same guest before proceeding.
> + */
> + gisa = (u32)virt_to_phys(&zdev->kzdev->kvm->arch.sie_page2->gisa);
> + if (zdev->gisa != 0 && zdev->gisa != gisa)
> + return -EPERM;
> +
> + if (zdev_enabled(zdev)) {
> + zdev->gisa = 0;
> + rc = zpci_disable_device(zdev);
> + if (rc)
> + return rc;
> + }
> +
> + /*
> + * Store information about the identity of the kvm guest allowed to
> + * access this device via interpretation to be used by host CLP
> + */
> + zdev->gisa = gisa;
> +
> + rc = zpci_enable_device(zdev);
> + if (rc)
> + goto err;
> +
> + /* Re-register the IOMMU that was already created */
> + rc = zpci_register_ioat(zdev, 0, zdev->start_dma, zdev->end_dma,
> + virt_to_phys(zdev->dma_table));
> + if (rc)
> + goto err;
> +
> + return rc;
> +
> +err:
> + zdev->gisa = 0;
> + return rc;
> +}
> +EXPORT_SYMBOL_GPL(kvm_s390_pci_interp_enable);
> +
> +int kvm_s390_pci_interp_disable(struct zpci_dev *zdev)
> +{
> + int rc;
> +
> + if (zdev->gisa == 0)
> + return -EINVAL;
> +
> + /* Remove the host CLP guest designation */
> + zdev->gisa = 0;
> +
> + if (zdev_enabled(zdev)) {
> + rc = zpci_disable_device(zdev);
> + if (rc)
> + return rc;
> + }
> +
> + rc = zpci_enable_device(zdev);
> + if (rc)
> + return rc;
> +
> + /* Re-register the IOMMU that was already created */
> + rc = zpci_register_ioat(zdev, 0, zdev->start_dma, zdev->end_dma,
> + virt_to_phys(zdev->dma_table));
> +
> + return rc;
> +}
> +EXPORT_SYMBOL_GPL(kvm_s390_pci_interp_disable);
> +
> int kvm_s390_pci_dev_open(struct zpci_dev *zdev)
> {
> struct kvm_zdev *kzdev;
> diff --git a/arch/s390/pci/pci.c b/arch/s390/pci/pci.c
> index 13033717cd4e..5dbe49ec325e 100644
> --- a/arch/s390/pci/pci.c
> +++ b/arch/s390/pci/pci.c
> @@ -147,6 +147,7 @@ int zpci_register_ioat(struct zpci_dev *zdev, u8 dmaas,
> zpci_dbg(3, "reg ioat fid:%x, cc:%d, status:%d\n", zdev->fid, cc, status);
> return cc;
> }
> +EXPORT_SYMBOL_GPL(zpci_register_ioat);
>
> /* Modify PCI: Unregister I/O address translation parameters */
> int zpci_unregister_ioat(struct zpci_dev *zdev, u8 dmaas)
> @@ -727,6 +728,7 @@ int zpci_enable_device(struct zpci_dev *zdev)
> zpci_update_fh(zdev, fh);
> return rc;
> }
> +EXPORT_SYMBOL_GPL(zpci_enable_device);
>
> int zpci_disable_device(struct zpci_dev *zdev)
> {
> @@ -750,6 +752,7 @@ int zpci_disable_device(struct zpci_dev *zdev)
> }
> return rc;
> }
> +EXPORT_SYMBOL_GPL(zpci_disable_device);
>
> /**
> * zpci_hot_reset_device - perform a reset of the given zPCI function
>

--
Pierre Morel
IBM Lab Boeblingen