2022-03-21 15:24:04

by Jean-Philippe Brucker

[permalink] [raw]
Subject: Re: [PATCH RFC 10/11] iommu: Make IOPF handling framework generic

Hi Kevin,

On Mon, Mar 21, 2022 at 08:09:36AM +0000, Tian, Kevin wrote:
> > From: Lu Baolu <[email protected]>
> > Sent: Sunday, March 20, 2022 2:40 PM
> >
> > The existing IOPF handling framework only handles the I/O page faults for
> > SVA. Ginven that we are able to link iommu domain with each I/O page fault,
> > we can now make the I/O page fault handling framework more general for
> > more types of page faults.
>
> "make ... generic" in subject line is kind of confusing. Reading this patch I
> think you really meant changing from per-device fault handling to per-domain
> fault handling. This is more accurate in concept since the fault is caused by
> the domain page table. ????

I tend to disagree with that last part. The fault is caused by a specific
device accessing shared page tables. We should keep that device
information throughout the fault handling, so that we can report it to the
driver when things go wrong. A process can have multiple threads bound to
different devices, they share the same mm so if the driver wanted to
signal a misbehaving thread, similarly to a SEGV on the CPU side, it would
need the device information to precisely report it to userspace.

Thanks,
Jean


2022-03-21 22:35:33

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH RFC 10/11] iommu: Make IOPF handling framework generic

On Mon, Mar 21, 2022 at 11:42:16AM +0000, Jean-Philippe Brucker wrote:

> I tend to disagree with that last part. The fault is caused by a specific
> device accessing shared page tables. We should keep that device
> information throughout the fault handling, so that we can report it to the
> driver when things go wrong.

SVA faults should never be reported to drivers??

> A process can have multiple threads bound to different devices, they
> share the same mm so if the driver wanted to signal a misbehaving
> thread, similarly to a SEGV on the CPU side, it would need the
> device information to precisely report it to userspace.

I'm not sure I understand this - we can't match DMAs to executing
CPUs. On fault we fail the DMA and let the process keep running or
SIGSEGV the whole thread group.

Jason

2022-03-22 01:26:26

by Tian, Kevin

[permalink] [raw]
Subject: RE: [PATCH RFC 10/11] iommu: Make IOPF handling framework generic

> From: Jean-Philippe Brucker <[email protected]>
> Sent: Monday, March 21, 2022 7:42 PM
>
> Hi Kevin,
>
> On Mon, Mar 21, 2022 at 08:09:36AM +0000, Tian, Kevin wrote:
> > > From: Lu Baolu <[email protected]>
> > > Sent: Sunday, March 20, 2022 2:40 PM
> > >
> > > The existing IOPF handling framework only handles the I/O page faults for
> > > SVA. Ginven that we are able to link iommu domain with each I/O page
> fault,
> > > we can now make the I/O page fault handling framework more general
> for
> > > more types of page faults.
> >
> > "make ... generic" in subject line is kind of confusing. Reading this patch I
> > think you really meant changing from per-device fault handling to per-
> domain
> > fault handling. This is more accurate in concept since the fault is caused by
> > the domain page table. ????
>
> I tend to disagree with that last part. The fault is caused by a specific
> device accessing shared page tables. We should keep that device
> information throughout the fault handling, so that we can report it to the
> driver when things go wrong. A process can have multiple threads bound to
> different devices, they share the same mm so if the driver wanted to
> signal a misbehaving thread, similarly to a SEGV on the CPU side, it would
> need the device information to precisely report it to userspace.
>

iommu driver can include the device information in the fault data. But
in concept the IOPF should be reported per domain.

and I agree with Jason that at most we can send SEGV to the entire thread
group since there is no way to associate a DMA back to a thread which
initiates the DMA.

Thanks
Kevin


2022-03-22 05:45:25

by Baolu Lu

[permalink] [raw]
Subject: Re: [PATCH RFC 10/11] iommu: Make IOPF handling framework generic

On 2022/3/21 20:43, Jason Gunthorpe wrote:
> On Mon, Mar 21, 2022 at 11:42:16AM +0000, Jean-Philippe Brucker wrote:
>
>> I tend to disagree with that last part. The fault is caused by a specific
>> device accessing shared page tables. We should keep that device
>> information throughout the fault handling, so that we can report it to the
>> driver when things go wrong.
> SVA faults should never be reported to drivers??
>

When things go wrong, the corresponding response code will be responded
to the device through iommu_page_response(). The hardware should then
report the failure to the device driver and the device driver will
handle it in the device-specific way. There's no need to propagate the
I/O page faults to the device driver in any case. Do I understand it
right?

Best regards,
baolu

2022-03-22 11:02:43

by Jean-Philippe Brucker

[permalink] [raw]
Subject: Re: [PATCH RFC 10/11] iommu: Make IOPF handling framework generic

On Tue, Mar 22, 2022 at 01:00:08AM +0000, Tian, Kevin wrote:
> > From: Jean-Philippe Brucker <[email protected]>
> > Sent: Monday, March 21, 2022 7:42 PM
> >
> > Hi Kevin,
> >
> > On Mon, Mar 21, 2022 at 08:09:36AM +0000, Tian, Kevin wrote:
> > > > From: Lu Baolu <[email protected]>
> > > > Sent: Sunday, March 20, 2022 2:40 PM
> > > >
> > > > The existing IOPF handling framework only handles the I/O page faults for
> > > > SVA. Ginven that we are able to link iommu domain with each I/O page
> > fault,
> > > > we can now make the I/O page fault handling framework more general
> > for
> > > > more types of page faults.
> > >
> > > "make ... generic" in subject line is kind of confusing. Reading this patch I
> > > think you really meant changing from per-device fault handling to per-
> > domain
> > > fault handling. This is more accurate in concept since the fault is caused by
> > > the domain page table. ????
> >
> > I tend to disagree with that last part. The fault is caused by a specific
> > device accessing shared page tables. We should keep that device
> > information throughout the fault handling, so that we can report it to the
> > driver when things go wrong. A process can have multiple threads bound to
> > different devices, they share the same mm so if the driver wanted to
> > signal a misbehaving thread, similarly to a SEGV on the CPU side, it would
> > need the device information to precisely report it to userspace.
> >
>
> iommu driver can include the device information in the fault data. But
> in concept the IOPF should be reported per domain.

So I don't remember where we left off on that topic, what about fault
injection into guests? In that case device info is more than just
diagnostic, fault injection can't work without it. I think we talked about
passing a device cookie to userspace, just want to make sure.

> and I agree with Jason that at most we can send SEGV to the entire thread
> group since there is no way to associate a DMA back to a thread which
> initiates the DMA.

The point is providing the most accurate information to the device driver
for diagnostics and debugging. A process opens multiple queues to
different devices, then if one of the queues issues invalid DMA, the
driver won't even know which queue is broken if you only report the target
mm and not the source dev. I don't think we gain anything from discarding
the device information from the fault path.

Thanks,
Jean

2022-03-22 15:58:11

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH RFC 10/11] iommu: Make IOPF handling framework generic

On Tue, Mar 22, 2022 at 01:03:14PM +0800, Lu Baolu wrote:
> On 2022/3/21 20:43, Jason Gunthorpe wrote:
> > On Mon, Mar 21, 2022 at 11:42:16AM +0000, Jean-Philippe Brucker wrote:
> >
> > > I tend to disagree with that last part. The fault is caused by a specific
> > > device accessing shared page tables. We should keep that device
> > > information throughout the fault handling, so that we can report it to the
> > > driver when things go wrong.
> > SVA faults should never be reported to drivers??
> >
>
> When things go wrong, the corresponding response code will be responded
> to the device through iommu_page_response(). The hardware should then
> report the failure to the device driver and the device driver will
> handle it in the device-specific way. There's no need to propagate the
> I/O page faults to the device driver in any case. Do I understand it
> right?

Something like that, I would expect fault failure to be similar to
accessing somethiing that is not in the iommu map. An Error TLP like
thing toward the device and whatever normal device-specific error
propagation happens.

SVA shouldn't require any special support in the using driver beyond
turing on PRI/ATS

Jason

2022-03-22 18:10:10

by Jean-Philippe Brucker

[permalink] [raw]
Subject: Re: [PATCH RFC 10/11] iommu: Make IOPF handling framework generic

On Tue, Mar 22, 2022 at 01:03:14PM +0800, Lu Baolu wrote:
> On 2022/3/21 20:43, Jason Gunthorpe wrote:
> > On Mon, Mar 21, 2022 at 11:42:16AM +0000, Jean-Philippe Brucker wrote:
> >
> > > I tend to disagree with that last part. The fault is caused by a specific
> > > device accessing shared page tables. We should keep that device
> > > information throughout the fault handling, so that we can report it to the
> > > driver when things go wrong.
> > SVA faults should never be reported to drivers??
> >
>
> When things go wrong, the corresponding response code will be responded
> to the device through iommu_page_response(). The hardware should then
> report the failure to the device driver and the device driver will
> handle it in the device-specific way. There's no need to propagate the
> I/O page faults to the device driver in any case. Do I understand it
> right?

In theory yes, but devices don't necessarily have the ability to report
precise errors, we may have more information.

Thanks,
Jean

2022-03-22 18:33:06

by Tian, Kevin

[permalink] [raw]
Subject: RE: [PATCH RFC 10/11] iommu: Make IOPF handling framework generic

> From: Jean-Philippe Brucker <[email protected]>
> Sent: Tuesday, March 22, 2022 6:06 PM
>
> On Tue, Mar 22, 2022 at 01:00:08AM +0000, Tian, Kevin wrote:
> > > From: Jean-Philippe Brucker <[email protected]>
> > > Sent: Monday, March 21, 2022 7:42 PM
> > >
> > > Hi Kevin,
> > >
> > > On Mon, Mar 21, 2022 at 08:09:36AM +0000, Tian, Kevin wrote:
> > > > > From: Lu Baolu <[email protected]>
> > > > > Sent: Sunday, March 20, 2022 2:40 PM
> > > > >
> > > > > The existing IOPF handling framework only handles the I/O page faults
> for
> > > > > SVA. Ginven that we are able to link iommu domain with each I/O
> page
> > > fault,
> > > > > we can now make the I/O page fault handling framework more
> general
> > > for
> > > > > more types of page faults.
> > > >
> > > > "make ... generic" in subject line is kind of confusing. Reading this patch
> I
> > > > think you really meant changing from per-device fault handling to per-
> > > domain
> > > > fault handling. This is more accurate in concept since the fault is caused
> by
> > > > the domain page table. ????
> > >
> > > I tend to disagree with that last part. The fault is caused by a specific
> > > device accessing shared page tables. We should keep that device
> > > information throughout the fault handling, so that we can report it to the
> > > driver when things go wrong. A process can have multiple threads bound
> to
> > > different devices, they share the same mm so if the driver wanted to
> > > signal a misbehaving thread, similarly to a SEGV on the CPU side, it would
> > > need the device information to precisely report it to userspace.
> > >
> >
> > iommu driver can include the device information in the fault data. But
> > in concept the IOPF should be reported per domain.
>
> So I don't remember where we left off on that topic, what about fault
> injection into guests? In that case device info is more than just
> diagnostic, fault injection can't work without it. I think we talked about
> passing a device cookie to userspace, just want to make sure.
>
> > and I agree with Jason that at most we can send SEGV to the entire thread
> > group since there is no way to associate a DMA back to a thread which
> > initiates the DMA.
>
> The point is providing the most accurate information to the device driver
> for diagnostics and debugging. A process opens multiple queues to
> different devices, then if one of the queues issues invalid DMA, the
> driver won't even know which queue is broken if you only report the target
> mm and not the source dev. I don't think we gain anything from discarding
> the device information from the fault path.
>

In case I didn't make it clear, what I talked about is just about having iommu
core to report IOPF per domain handler vs. per device handler while this
design choice doesn't change what the fault data should include (device,
pasid, addr, etc.). i.e. it always includes all the information provided by the
iommu driver no matter how the fault is reported upwards.

e.g. with iommufd it is iommufd to register a IOPF handler per managed
domain and receive IOPF on those domains. If necessary, iommufd further
forwards to userspace including device cookie according to the fault data.

Thanks
Kevin