2009-10-23 01:22:50

by Chris Wright

[permalink] [raw]
Subject: [RFC PATCH 0/3] allow fallback to swiotlb on hw iommu init failures

This short series gives us the ability to allocate the swiotlb and then
conditionally free it if we discover it isn't needed. This allows us to
put swiotlb to use when the hw iommu fails to initialize properly.

This needs some changes to the bootmem allocator to give the ability to
free reserved bootmem directly to the page allocator after bootmem is
torn down.

arch/x86/include/asm/swiotlb.h | 4 ++
arch/x86/kernel/pci-dma.c | 4 +-
arch/x86/kernel/pci-swiotlb.c | 27 +++++++++---
include/linux/bootmem.h | 1 +
include/linux/swiotlb.h | 3 +
lib/swiotlb.c | 10 ++++
mm/bootmem.c | 98 +++++++++++++++++++++++++++++++---------
7 files changed, 118 insertions(+), 29 deletions(-)

thanks,
-chris


2009-10-23 05:51:36

by FUJITA Tomonori

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] allow fallback to swiotlb on hw iommu init failures

On Thu, 22 Oct 2009 18:21:58 -0700
Chris Wright <[email protected]> wrote:

> This short series gives us the ability to allocate the swiotlb and then
> conditionally free it if we discover it isn't needed. This allows us to
> put swiotlb to use when the hw iommu fails to initialize properly.
>
> This needs some changes to the bootmem allocator to give the ability to
> free reserved bootmem directly to the page allocator after bootmem is
> torn down.

The concept sounds fine but the third patch doesn't look correct.

Seems that the third patch doesn't take into account enabling both hw
iommu and swiotlb (Calgary does and I guess VT-d and AMD need that
too). Also (iommu_detected && !dma_ops) trick doesn't work for
Calgary, IIRC. The third patch also makes the dma startup code more
complicated.

I have half-baked patches to clean up the dma startup code supporting
the concept. I can work on the top of the first and second
patches. They need to be CC'ed to the memory people and ACKs, don't
they?


Thanks,

2009-10-23 16:39:55

by Chris Wright

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] allow fallback to swiotlb on hw iommu init failures

* FUJITA Tomonori ([email protected]) wrote:
> On Thu, 22 Oct 2009 18:21:58 -0700
> Chris Wright <[email protected]> wrote:
>
> > This short series gives us the ability to allocate the swiotlb and then
> > conditionally free it if we discover it isn't needed. This allows us to
> > put swiotlb to use when the hw iommu fails to initialize properly.
> >
> > This needs some changes to the bootmem allocator to give the ability to
> > free reserved bootmem directly to the page allocator after bootmem is
> > torn down.
>
> The concept sounds fine but the third patch doesn't look correct.
>
> Seems that the third patch doesn't take into account enabling both hw
> iommu and swiotlb (Calgary does and I guess VT-d and AMD need that
> too).

VT-d isn't using swiotlb. Nor is AMD, although I think it will pick up
no_iommu on passthrough (seems like it would benefit from swiotlb in
that case).

> Also (iommu_detected && !dma_ops) trick doesn't work for
> Calgary, IIRC.

Yes, I think you are right. I had stared at the calgary code and thought
it would DTRT due to calgary's use of no_iommu as fallback, but instead
it will never pick up swiotlb_dma_ops. The calgary shouldn't even need
to be manually setting up nommu_dma_ops.

> The third patch also makes the dma startup code more
> complicated.

I completely agree. The whole dma/iommu startup is already complex
and fragile. Issues like above made getting the right combination
like whack-a-mole.

> I have half-baked patches to clean up the dma startup code supporting
> the concept. I can work on the top of the first and second
> patches. They need to be CC'ed to the memory people and ACKs, don't
> they?

Great.

2009-10-24 03:07:53

by FUJITA Tomonori

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] allow fallback to swiotlb on hw iommu init failures

On Fri, 23 Oct 2009 09:39:23 -0700
Chris Wright <[email protected]> wrote:

> > The concept sounds fine but the third patch doesn't look correct.
> >
> > Seems that the third patch doesn't take into account enabling both hw
> > iommu and swiotlb (Calgary does and I guess VT-d and AMD need that
> > too).
>
> VT-d isn't using swiotlb. Nor is AMD, although I think it will pick up
> no_iommu on passthrough (seems like it would benefit from swiotlb in
> that case).

I think that they need swiotlb for the same reason why Calgray needs
it. IIRC, someone in VT-d camp said so.


> > Also (iommu_detected && !dma_ops) trick doesn't work for
> > Calgary, IIRC.
>
> Yes, I think you are right. I had stared at the calgary code and thought
> it would DTRT due to calgary's use of no_iommu as fallback, but instead
> it will never pick up swiotlb_dma_ops.

Note that Calgary comment 'falling back to no_iommu' is misleading. It
actually falls back to swiotlb or nommu properly.

Calgary doesn't set to dma_ops to calgary_dma_ops so it doesn't need
to pick up swiotlb_dma_ops.


> The calgary shouldn't even need to be manually setting up
> nommu_dma_ops.

Yeah, but it needs because of how the dma startup code works.

2009-10-24 06:57:31

by Chris Wright

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] allow fallback to swiotlb on hw iommu init failures

* FUJITA Tomonori ([email protected]) wrote:
> On Fri, 23 Oct 2009 09:39:23 -0700
> Chris Wright <[email protected]> wrote:
>
> > > The concept sounds fine but the third patch doesn't look correct.
> > >
> > > Seems that the third patch doesn't take into account enabling both hw
> > > iommu and swiotlb (Calgary does and I guess VT-d and AMD need that
> > > too).
> >
> > VT-d isn't using swiotlb. Nor is AMD, although I think it will pick up
> > no_iommu on passthrough (seems like it would benefit from swiotlb in
> > that case).
>
> I think that they need swiotlb for the same reason why Calgray needs
> it. IIRC, someone in VT-d camp said so.

Right, it was used as fallback for a bit when pass through mode was
first enabled to handle case where device has dma mask smaller than
physical memory. That was removed, it was just recently w/ Alex's
comments that the idea of putting it back came up.

> > > Also (iommu_detected && !dma_ops) trick doesn't work for
> > > Calgary, IIRC.
> >
> > Yes, I think you are right. I had stared at the calgary code and thought
> > it would DTRT due to calgary's use of no_iommu as fallback, but instead
> > it will never pick up swiotlb_dma_ops.
>
> Note that Calgary comment 'falling back to no_iommu' is misleading. It
> actually falls back to swiotlb or nommu properly.
>
> Calgary doesn't set to dma_ops to calgary_dma_ops so it doesn't need
> to pick up swiotlb_dma_ops.

It does need swiotlb_dma_ops even when calgary init succeeds for the devices
that aren't behind Calgary to deal w/ the case of those devices having
dma mask smaller than physical memory (i.e those that don't get device
specific dma_ops set to calgary_dma_ops). So, the problem with this
patch is that it would fall back to nommu_dma_ops in cases where it was
expecting to fall back to swiotlb.

> > The calgary shouldn't even need to be manually setting up
> > nommu_dma_ops.
>
> Yeah, but it needs because of how the dma startup code works.

Be much better to have the core handle all of this. Basically register
ops and then put the top one on the stack to actual use. So the
fallback would be automatic, just pick the top of the stack and go.
Were you thinking of something along those lines?

thanks,
-chris

2009-10-26 07:10:28

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] allow fallback to swiotlb on hw iommu init failures

Chris Wright <[email protected]> writes:

> This short series gives us the ability to allocate the swiotlb and then
> conditionally free it if we discover it isn't needed. This allows us to
> put swiotlb to use when the hw iommu fails to initialize properly.
>
> This needs some changes to the bootmem allocator to give the ability to
> free reserved bootmem directly to the page allocator after bootmem is
> torn down.

You forgot to state what motivated you to that change?

-Andi

--
[email protected] -- Speaking for myself only.

2009-10-26 16:27:04

by Chris Wright

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] allow fallback to swiotlb on hw iommu init failures

* Andi Kleen ([email protected]) wrote:
> Chris Wright <[email protected]> writes:
>
> > This short series gives us the ability to allocate the swiotlb and then
> > conditionally free it if we discover it isn't needed. This allows us to
> > put swiotlb to use when the hw iommu fails to initialize properly.
> >
> > This needs some changes to the bootmem allocator to give the ability to
> > free reserved bootmem directly to the page allocator after bootmem is
> > torn down.
>
> You forgot to state what motivated you to that change?

I thought I did ;-) Here's another more verbose attempt:

The HW IOMMU, for example Intel VT-d, may fail initialization (typically
due to BIOS bugs). In that case the existing fallback is nommu, which is
clearly insufficient for many boxen which need some bounce buffering if
there is no HW IOMMU. The problem is that at the point of this failure
the decision to allocate and initialize the swiotlb has long since past.

There are 4 ways to handle this:

1) Give up and panic the box. This is not a user friendly option since
the box will boot and function fine (minus any isolation gained from the
HW IOMMU) if there is either not much phys mem or an swiotlb.

2) Do the discovery that causes the initialization failure earlier so
that HW IOMMU detection fails. Compilcated by the HW IOMMU's use of the
run time env that includes a real allocator and PCI enumeration, etc.

3) Allow the swiotlb to be allocated later in pci_iommu_init() instead
of early in pci_iommu_alloc(), IOW don't use bootmem for the swiotlb.
This is possible, although it will hit 2 limitations. The first is any
possible fragmentation that limits the availability of a 64M region
by the time this runs. The second is that x86 has MAX_ORDER of 11,
so at most we can allocate a 4M region from the page allocator which is
inusufficient for swiotlb.

4) Allow the swiotlb to be allocated in pci_iommu_alloc(), but not
initialized until pci_iommu_init(). This allows using bootmem allocator
to reserve a nice large contiguous chunk of memory, but requires some
way to give that memory back in the case that a HW IOMMU is properly
both detected and initialized (else it'd be a 64M memory leak for
effectively all HW IOMMU enabled boxen).

This series implements the fourth option.

thanks,
-chris

2009-10-26 21:56:52

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] allow fallback to swiotlb on hw iommu init failures

[... 4 options described ...]
>
> This series implements the fourth option.

Ok makes sense.

This should be in the git log somewhere.

-Andi

--
[email protected] -- Speaking for myself only.

2009-10-28 06:54:24

by FUJITA Tomonori

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] allow fallback to swiotlb on hw iommu init failures

On Fri, 23 Oct 2009 23:57:12 -0700
Chris Wright <[email protected]> wrote:

> > Note that Calgary comment 'falling back to no_iommu' is misleading. It
> > actually falls back to swiotlb or nommu properly.
> >
> > Calgary doesn't set to dma_ops to calgary_dma_ops so it doesn't need
> > to pick up swiotlb_dma_ops.
>
> It does need swiotlb_dma_ops even when calgary init succeeds for the devices
> that aren't behind Calgary to deal w/ the case of those devices having
> dma mask smaller than physical memory (i.e those that don't get device
> specific dma_ops set to calgary_dma_ops).

I know since I wrote that code.


> > > The calgary shouldn't even need to be manually setting up
> > > nommu_dma_ops.
> >
> > Yeah, but it needs because of how the dma startup code works.
>
> Be much better to have the core handle all of this. Basically register
> ops and then put the top one on the stack to actual use. So the
> fallback would be automatic, just pick the top of the stack and go.
> Were you thinking of something along those lines?

I don't know what 'register ops and then put the top one on the stack'
means but it sounds overdoing.

I think that we can handle this issue more simply. I've just posted
the patchset.