2023-09-27 16:28:06

by Niklas Schnelle

[permalink] [raw]
Subject: Re: [PATCH v12 0/6] iommu/dma: s390 DMA API conversion and optimized IOTLB flushing

On Wed, 2023-09-27 at 16:31 +0200, Niklas Schnelle wrote:
> On Wed, 2023-09-27 at 15:20 +0200, Niklas Schnelle wrote:
> > On Wed, 2023-09-27 at 13:24 +0200, Niklas Schnelle wrote:
> > > On Wed, 2023-09-27 at 11:55 +0200, Joerg Roedel wrote:
> > > > Hi Niklas,
> > > >
> > > > On Wed, Sep 27, 2023 at 10:55:23AM +0200, Niklas Schnelle wrote:
> > > > > The problem is that something seems to be broken in the iommu/core
> > > > > branch. Regardless of whether I have my DMA API conversion on top or
> > > > > with the base iommu/core branch I can not use ConnectX-4 VFs.
> > > >
> > > > Have you already tried to bisect the issue in the iommu/core branch?
> > > > The result might sched some light on the issue.
> > > >
> > > > Regards,
> > > >
> > > > Joerg
> > >
> > > Hi Joerg,
> > >
> > > Working on it, somehow I must have messed up earlier. It now looks like
> > > it might in fact be caused by my DMA API conversion rebase and the
> > > "s390/pci: Use dma-iommu layer" commit. Maybe there is some interaction
> > > with Jason's patches that I haven't thought about. So sorry for any
> > > wrong blame.
> > >
> > > Thanks,
> > > Niklas
> >
> > Hi,
> >
> > I tracked the problem down from mlx5_core's alloc_cmd_page() via
> > dma_alloc_coherent(), ops->alloc, iommu_dma_alloc_remap(), and
> > __iommu_dma_alloc_noncontiguous() to a failed iommu_dma_alloc_iova().
> > The allocation here is for 4K so nothing crazy.
> >
> > On second look I also noticed:
> >
> > nvme 2007:00:00.0: Using 42-bit DMA addresses
> >
> > for the NVMe that is working. The problem here seems to be that we set
> > iommu_dma_forcedac = true in s390_iommu_probe_finalize() because we
> > have currently have a reserved region over the first 4 GiB anyway so
> > will always use IOVAs larger than that. That however is too late since
> > iommu_dma_set_pci_32bit_workaround() is already checked in
> > __iommu_probe_device() which is called just before ops-
> > > probe_finalize(). So I moved setting iommu_dma_forcedac = true to
> > zpci_init_iommu() and that gets rid of the notice for the NVMe but I
> > still get a failure of iommu_dma_alloc_iova() in
> > __iommu_dma_alloc_noncontiguous(). So I'll keep digging.
> >
> > Thanks,
> > Niklas
>
>
> Ok I think I got it and this doesn't seem strictly s390x specific but
> I'd think should happen with iommu.forcedac=1 everywhere.
>
> The reason iommu_dma_alloc_iova() fails seems to be that mlx5_core does
> dma_set_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(64)) in 
> mlx5_pci_init()->set_dma_caps() which happens after it already called
> mlx5_mdev_init()->mlx5_cmd_init()->alloc_cmd_page() so for the
> dma_alloc_coherent() in there the dev->coherent_dma_mask is still
> DMA_BIT_MASK(32) for which we can't find an IOVA because well we don't
> have IOVAs below 4 GiB. Not entirely sure what caused this not to be
> enforced before.
>
> Thanks,
> Niklas
>

Ok, another update. On trying it out again this problem actually also
occurs when applying this v12 on top of v6.6-rc3 too. Also I guess
unlike my prior thinking it probably doesn't occur with
iommu.forcedac=1 since that still allows IOVAs below 4 GiB and we might
be the only ones who don't support those. From my point of view this
sounds like a mlx5_core issue they really should call
dma_set_mask_and_coherent() before their first call to
dma_alloc_coherent() not after. So I guess I'll send a v13 of this
series rebased on iommu/core and with an additional mlx5 patch and then
let's hope we can get that merged in a way that doesn't leave us with
broken ConnectX VFs for too long.

Thanks,
Niklas


2023-09-27 19:32:28

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v12 0/6] iommu/dma: s390 DMA API conversion and optimized IOTLB flushing

On Wed, Sep 27, 2023 at 05:24:20PM +0200, Niklas Schnelle wrote:

> Ok, another update. On trying it out again this problem actually also
> occurs when applying this v12 on top of v6.6-rc3 too. Also I guess
> unlike my prior thinking it probably doesn't occur with
> iommu.forcedac=1 since that still allows IOVAs below 4 GiB and we might
> be the only ones who don't support those. From my point of view this
> sounds like a mlx5_core issue they really should call
> dma_set_mask_and_coherent() before their first call to
> dma_alloc_coherent() not after. So I guess I'll send a v13 of this
> series rebased on iommu/core and with an additional mlx5 patch and then
> let's hope we can get that merged in a way that doesn't leave us with
> broken ConnectX VFs for too long.

Yes, OK. It definitely sounds wrong that mlx5 is doing dma allocations before
setting it's dma_set_mask_and_coherent(). Please link to this thread
and we can get Leon or Saeed to ack it for Joerg.

(though wondering why s390 is the only case that ever hit this?)

Jason

2023-09-28 02:56:59

by Matthew Rosato

[permalink] [raw]
Subject: Re: [PATCH v12 0/6] iommu/dma: s390 DMA API conversion and optimized IOTLB flushing

On 9/27/23 11:40 AM, Jason Gunthorpe wrote:
> On Wed, Sep 27, 2023 at 05:24:20PM +0200, Niklas Schnelle wrote:
>
>> Ok, another update. On trying it out again this problem actually also
>> occurs when applying this v12 on top of v6.6-rc3 too. Also I guess
>> unlike my prior thinking it probably doesn't occur with
>> iommu.forcedac=1 since that still allows IOVAs below 4 GiB and we might
>> be the only ones who don't support those. From my point of view this
>> sounds like a mlx5_core issue they really should call
>> dma_set_mask_and_coherent() before their first call to
>> dma_alloc_coherent() not after. So I guess I'll send a v13 of this
>> series rebased on iommu/core and with an additional mlx5 patch and then
>> let's hope we can get that merged in a way that doesn't leave us with
>> broken ConnectX VFs for too long.
>
> Yes, OK. It definitely sounds wrong that mlx5 is doing dma allocations before
> setting it's dma_set_mask_and_coherent(). Please link to this thread
> and we can get Leon or Saeed to ack it for Joerg.
>

Hi Niklas,

I bisected the start of this issue to the following commit (only noticeable on s390 when you apply this subject series on top):

06cd555f73caec515a14d42ef052221fa2587ff9 ("net/mlx5: split mlx5_cmd_init() to probe and reload routines")

Which went in during the merge window. Please include with your fix and/or report to the mlx5 maintainers. Looks like the changes in this patch match what you and Jason describe; it splits up mlx5_cmd_init() and moves part of the call earlier. The net result is we first call mlx5_mdev_init>mlx5_cmd_init->alloc_cmd_page->dma_alloc_coherent and then sometime later call mlx5_pci_init->set_dma_caps->dma_set_mask_and_coherent.

Prior to this patch, we would not drive mlx5_cmd_init (and thus that first dma_alloc_coherent) until mlx5_init_one which happens _after_ mlx5_pci_init->set_dma_caps->dma_set_mask_and_coherent.

Thanks,
Matt