LinuxLists.cc - Re: swiotlb detection should be memory hotplug aware ?

2010-07-20 22:15:00

Subject: Re: swiotlb detection should be memory hotplug aware ?

Hi,

Reviving a 4 month old thread.
I am still waiting for any clues on this question below.

>> 2. Instead of checking the max_pfn value in pci_swiotlb_detect, check
>> for max_hotpluggable_pfn (or some such) value. Though I don't see such a
>> value readily available. I could parse the SRAT and get hotplug memory
>> information but that will make swiotlb detection logic a little too
>> complex. A quick look around srat_xx.c files and the acpi_memhotplug
>> module didn't find any useful API that could be used directly either.
>> So was wondering if any of you are aware of an easy way to get such
>> information ?

Thanks,
Alok

On Wed, 2010-03-17 at 15:48 -0700, Alok Kataria wrote:
> On Tue, 2010-03-16 at 05:45 -0700, Konrad Rzeszutek Wilk wrote:
> > On Tue, Mar 16, 2010 at 10:33:20AM +0900, FUJITA Tomonori wrote:
> > > On Mon, 15 Mar 2010 20:51:40 -0400
> > > Konrad Rzeszutek Wilk <[email protected]> wrote:
> > >
> > > > On Fri, Mar 12, 2010 at 07:09:41PM -0800, Andi Kleen wrote:
> > > > > , Alok Kataria wrote:
> > > > >
> > > > > Hi Alok,
> > > > >
> > > > >> Hi,
> > > > >>
> > > > >> Looking at the current code swiotlb is initialized for 64bit kernels
> > > > >> only when the max_pfn value is greater than 4G (MAX_DMA32_PFN value).
> > > > >> So in cases when the initial memory is less than 4GB the kernel boots
> > > > >> without enabling swiotlb, when we hotadd memory to such a kernel and go
> > > > >> beyond the 4G limit, swiotlb is still disabled. As a result when any
> > > > >> 32bit devices start using this newly added memory beyond 4G, the kernel
> > > > >> starts spitting error messages like below or in some cases it causes
> > > > >> kernel panics.
> > > > >
> > > > > Yes seems like a real problem.
> > > > >
> > > > >>
> > > > >> 1. Enable swiotlb for all 64bit kernels which have memory hot-add
> > > > >> support.
> > > > >
> > > > > I don't think that's a good idea. It would enable it everywhere on
> > > > > distributions which compile with hotadd. Need (2)
> > > > >
> > > > >> 2. Instead of checking the max_pfn value in pci_swiotlb_detect, check
> > > > >> for max_hotpluggable_pfn (or some such) value. Though I don't see such a
> > > > >> value readily available. I could parse the SRAT and get hotplug memory
> > > > >> information but that will make swiotlb detection logic a little too
> > > > >> complex. A quick look around srat_xx.c files and the acpi_memhotplug
> > > > >> module didn't find any useful API that could be used directly either.
> > > > >> So was wondering if any of you are aware of an easy way to get such
> > > > >> information ?
> > > > >
> > > > > I have a patchkit to revamp the SRAT parsing to store the hotadd information
> > > >
>
> Andi...ping any pointers to the patchkit.

> > > > There is a late mechanism to do kickoff the SWIOTLB. Perhaps the hot-add
> > > > could use swiotlb_init_late and start up the SWIOTLB?
>
> I don't see why we need to do this via late_init, swiotlb detection that
> happens through pci_swiotlb_detect, is already late enough that SRAT is
> already parsed. Or am I missing something ?
> > >
> > > I guess that you are talking about
> > > swiotlb_late_init_with_default_size(), which IA64 uses. However, you
> > > can use swiotlb_late_init_with_default_size() only before we
> > > initialize devices. Making it work after initializing devices is not
> > > so easy, I think (that is, we need to change dma_ops).
>
> > That is a good point. Especially if we have some outstanding DMA pages
> > allocated via dma_alloc_coherent.
> >
> > I thought that the machines that have hot-add memory they have their
> > own fancy IOMMU. For example the IBM x3955 (and its family) utilize the
> > Calgary IOMMU. The HP boxes utilize the Intel VT-D (or the AMD
> > equivalant).
> > So is this mostly specialized in the areas of virtualized guests? (Xen
> > PV guests with PCI passthrough suffer the same problem, btw).
>
>
> I am assuming that there were Intel based servers which supported memory
> hot-add before VT-d too. So, IMO this is not specialized to
> virtualization, though might be hard to prove if there are actual
> physical machines out there which have similar constraints (no HWIOMMU +
> MEMHOT add support)
>
> Thanks,
> Alok

2010-07-21 05:03:35

by FUJITA Tomonori

[permalink] [raw]

Subject: Re: swiotlb detection should be memory hotplug aware ?

On Tue, 20 Jul 2010 15:14:57 -0700
Alok Kataria <[email protected]> wrote:

> Reviving a 4 month old thread.
> I am still waiting for any clues on this question below.

Basically, you want to add hot-plug memory and enable swiotlb, right?

We can't start swiotlb reliably after a system starts.

See dma32_reserve_boatmen() and dma32_free_bootmem(). What we do is
reserving huge memory in DMA32 zone for swiotlb and releasing it if we
find that we don't need swiotlb. We can't find enough memory for
swiotlb in dma32 after a system starts.

2010-07-21 17:13:36

by Alok Kataria

[permalink] [raw]

Subject: Re: swiotlb detection should be memory hotplug aware ?

On Tue, 2010-07-20 at 21:58 -0700, FUJITA Tomonori wrote:
> On Tue, 20 Jul 2010 15:14:57 -0700
> Alok Kataria <[email protected]> wrote:
>
> > Reviving a 4 month old thread.
> > I am still waiting for any clues on this question below.
>
> Basically, you want to add hot-plug memory and enable swiotlb, right?

Not really, I am planning to do something like this,

@@ -52,7 +52,7 @@ int __init pci_swiotlb_detect(void)

/* don't initialize swiotlb if iommu=off (no_iommu=1) */
#ifdef CONFIG_X86_64
- if (!no_iommu && max_pfn > MAX_DMA32_PFN)
+ if (!no_iommu && (max_pfn > MAX_DMA32_PFN || hotplug_possible()))
swiotlb = 1;
#endif
if (swiotlb_force)

BUT, I don't know how that hotplug_possible function will look like or
if such an interface already exists in the kernel (my search didn't turn
up any) ?

IMO, it should be possible to go read the SRAT to see if this system has
support for hotplug memory and then enable swiotlb if it does.

Sounds right ?

Thanks,
Alok

>
> We can't start swiotlb reliably after a system starts.
>
> See dma32_reserve_boatmen() and dma32_free_bootmem(). What we do is
> reserving huge memory in DMA32 zone for swiotlb and releasing it if we
> find that we don't need swiotlb. We can't find enough memory for
> swiotlb in dma32 after a system starts.

2010-07-21 23:44:57

by FUJITA Tomonori

[permalink] [raw]

Subject: Re: swiotlb detection should be memory hotplug aware ?

On Wed, 21 Jul 2010 10:13:34 -0700
Alok Kataria <[email protected]> wrote:

> > Basically, you want to add hot-plug memory and enable swiotlb, right?
>
> Not really, I am planning to do something like this,
>
> @@ -52,7 +52,7 @@ int __init pci_swiotlb_detect(void)
>
> /* don't initialize swiotlb if iommu=off (no_iommu=1) */
> #ifdef CONFIG_X86_64
> - if (!no_iommu && max_pfn > MAX_DMA32_PFN)
> + if (!no_iommu && (max_pfn > MAX_DMA32_PFN || hotplug_possible()))
> swiotlb = 1;

Always enable swiotlb with memory hotplug enabled? Wasting 64MB on a
x86_64 system with 128MB doesn't look to be a good idea. I don't think
that there is an easy solution for this issue though.

2010-07-22 00:04:15

by FUJITA Tomonori

[permalink] [raw]

Subject: Re: swiotlb detection should be memory hotplug aware ?

On Thu, 22 Jul 2010 08:44:42 +0900
FUJITA Tomonori <[email protected]> wrote:

> On Wed, 21 Jul 2010 10:13:34 -0700
> Alok Kataria <[email protected]> wrote:
>
> > > Basically, you want to add hot-plug memory and enable swiotlb, right?
> >
> > Not really, I am planning to do something like this,
> >
> > @@ -52,7 +52,7 @@ int __init pci_swiotlb_detect(void)
> >
> > /* don't initialize swiotlb if iommu=off (no_iommu=1) */
> > #ifdef CONFIG_X86_64
> > - if (!no_iommu && max_pfn > MAX_DMA32_PFN)
> > + if (!no_iommu && (max_pfn > MAX_DMA32_PFN || hotplug_possible()))
> > swiotlb = 1;
>
> Always enable swiotlb with memory hotplug enabled? Wasting 64MB on a
> x86_64 system with 128MB doesn't look to be a good idea. I don't think
> that there is an easy solution for this issue though.

btw, you need more work to enable switch on the fly.

You need to change the dma_ops pointer (see get_dma_ops()). It means
that you need to track outstanding dma operations per device, locking,
etc.

2010-07-22 18:34:42

by Alok Kataria

[permalink] [raw]

Subject: Re: swiotlb detection should be memory hotplug aware ?

Hi,

On Wed, 2010-07-21 at 17:03 -0700, FUJITA Tomonori wrote:
> On Thu, 22 Jul 2010 08:44:42 +0900
> FUJITA Tomonori <[email protected]> wrote:
>
> > On Wed, 21 Jul 2010 10:13:34 -0700
> > Alok Kataria <[email protected]> wrote:
> >
> > > > Basically, you want to add hot-plug memory and enable swiotlb, right?
> > >
> > > Not really, I am planning to do something like this,
> > >
> > > @@ -52,7 +52,7 @@ int __init pci_swiotlb_detect(void)
> > >
> > > /* don't initialize swiotlb if iommu=off (no_iommu=1) */
> > > #ifdef CONFIG_X86_64
> > > - if (!no_iommu && max_pfn > MAX_DMA32_PFN)
> > > + if (!no_iommu && (max_pfn > MAX_DMA32_PFN || hotplug_possible()))
> > > swiotlb = 1;
> >
> > Always enable swiotlb with memory hotplug enabled?

yep though only on systems which have hotpluggable memory support.

> Wasting 64MB on a
> > x86_64 system with 128MB doesn't look to be a good idea. I don't think
> > that there is an easy solution for this issue though.

Good now that you agree that, that's the only feasible solution, do you
have any suggestions for any interfaces that are available from SRAT for
implementing hotplug_possible ?

>
> btw, you need more work to enable switch on the fly.
>
> You need to change the dma_ops pointer (see get_dma_ops()). It means
> that you need to track outstanding dma operations per device, locking,
> etc.

Yeah though if we are doing this during swiotlb_init time i.e. at bootup
as suggested in the pseudo patch, we don't need to worry about all this,
right ?

Thanks,
Alok

2010-07-23 14:24:30

by Konrad Rzeszutek Wilk

[permalink] [raw]

Subject: Re: swiotlb detection should be memory hotplug aware ?

On Thu, Jul 22, 2010 at 11:34:40AM -0700, Alok Kataria wrote:
> Hi,
>
> On Wed, 2010-07-21 at 17:03 -0700, FUJITA Tomonori wrote:
> > On Thu, 22 Jul 2010 08:44:42 +0900
> > FUJITA Tomonori <[email protected]> wrote:
> >
> > > On Wed, 21 Jul 2010 10:13:34 -0700
> > > Alok Kataria <[email protected]> wrote:
> > >
> > > > > Basically, you want to add hot-plug memory and enable swiotlb, right?
> > > >
> > > > Not really, I am planning to do something like this,
> > > >
> > > > @@ -52,7 +52,7 @@ int __init pci_swiotlb_detect(void)
> > > >
> > > > /* don't initialize swiotlb if iommu=off (no_iommu=1) */
> > > > #ifdef CONFIG_X86_64
> > > > - if (!no_iommu && max_pfn > MAX_DMA32_PFN)
> > > > + if (!no_iommu && (max_pfn > MAX_DMA32_PFN || hotplug_possible()))
> > > > swiotlb = 1;
> > >
> > > Always enable swiotlb with memory hotplug enabled?
>
> yep though only on systems which have hotpluggable memory support.

What machines are there that have hotplug support and no hardware IOMMU?
I know of the IBM ones - but they use the Calgary IOMMU.
>
> > Wasting 64MB on a
> > > x86_64 system with 128MB doesn't look to be a good idea. I don't think
> > > that there is an easy solution for this issue though.
>
> Good now that you agree that, that's the only feasible solution, do you
> have any suggestions for any interfaces that are available from SRAT for
> implementing hotplug_possible ?

I thought SRAT has NUMA affinity information - so for example my AMD
desktop box has that, but it does not support hotplug capability.

I think first your 'hotplug_possible' code needs to be more specific -
not just check if SRAT exists, but also if there are swaths of memory
that are non-populated. It would also help if there was some indication
of whether the box truly does a hardware hotplug - is there a way to do
this?

>
> >
> > btw, you need more work to enable switch on the fly.
> >
> > You need to change the dma_ops pointer (see get_dma_ops()). It means
> > that you need to track outstanding dma operations per device, locking,
> > etc.
>
> Yeah though if we are doing this during swiotlb_init time i.e. at bootup
> as suggested in the pseudo patch, we don't need to worry about all this,
> right ?

Right..
>
> Thanks,
> Alok

2010-07-23 14:33:39

by Andi Kleen

[permalink] [raw]

Subject: Re: swiotlb detection should be memory hotplug aware ?

> I thought SRAT has NUMA affinity information - so for example my AMD
> desktop box has that, but it does not support hotplug capability.
>
> I think first your 'hotplug_possible' code needs to be more specific -
> not just check if SRAT exists, but also if there are swaths of memory
> that are non-populated. It would also help if there was some indication
> of whether the box truly does a hardware hotplug - is there a way to do
> this?

The SRAT declares hotplug memory ranges in advance. And Linux already
uses this
information in the SRAT parser (just the code for doing this is a bit
dumb, I have a rewrite
somewhere)

The only drawback is that some older systems claimed to have large
hotplug memory ranges
when they didn't actually support it. So it's better to not do anything
with a lot
of overhead.

So yes it would be reasonable to let swiotlb (and possibly other code
sizing itself
based on memory) call into the SRAT parser and check the hotplug ranges too.

BTW longer term swiotlb should be really more dynamic anyways and grow
and shrink on demand. I attempted this some time ago with my DMA
allocator patchkit,
unfortunately that didn't go forward.

-Andi

2010-07-23 15:00:13

by Konrad Rzeszutek Wilk

[permalink] [raw]

Subject: Re: swiotlb detection should be memory hotplug aware ?

On Fri, Jul 23, 2010 at 04:33:32PM +0200, Andi Kleen wrote:
>
> >I thought SRAT has NUMA affinity information - so for example my AMD
> >desktop box has that, but it does not support hotplug capability.
> >
> >I think first your 'hotplug_possible' code needs to be more specific -
> >not just check if SRAT exists, but also if there are swaths of memory
> >that are non-populated. It would also help if there was some indication
> >of whether the box truly does a hardware hotplug - is there a way to do
> >this?
>
> The SRAT declares hotplug memory ranges in advance. And Linux
> already uses this
> information in the SRAT parser (just the code for doing this is a
> bit dumb, I have a rewrite
> somewhere)
>
> The only drawback is that some older systems claimed to have large
> hotplug memory ranges
> when they didn't actually support it. So it's better to not do
> anything with a lot
> of overhead.
>
> So yes it would be reasonable to let swiotlb (and possibly other
> code sizing itself
> based on memory) call into the SRAT parser and check the hotplug ranges too.
>
> BTW longer term swiotlb should be really more dynamic anyways and grow
> and shrink on demand. I attempted this some time ago with my DMA

I was thinking about this at some point. I think the first step is to
make SWIOTLB use the debugfs to actually print out how much of its
buffers are used - and see if the 64MB is a good fit.

The shrinking part scares me - I think it might be more prudent to first
explore on how to grow it. The big problem looks to allocate a physical
contiguity set of pages. And I guess SWIOTLB would need to change from
using one big region to something of a pool system?

> allocator patchkit,
> unfortunately that didn't go forward.

I wasn't present at that time so I don't know what the issues were - you
wouldn't have a link to LKML for this?

2010-07-23 15:23:39

by Andi Kleen

[permalink] [raw]

Subject: Re: swiotlb detection should be memory hotplug aware ?

> I was thinking about this at some point. I think the first step is to
> make SWIOTLB use the debugfs to actually print out how much of its
> buffers are used - and see if the 64MB is a good fit.

swiotlb is near always wrongly sized. For most system it's far too much,
but for some
not enough. I have some systemtap scripts around to instrument it.

Also it depends on the IO load, so if you size it reasonable you
risk overflow on large IO (however these days this very rarely happens
because
all "serious" IO devices don't need swiotlb anymore)

The other problem is that using only two bits for the needed address
space is also extremly
inefficient (4GB and 16MB on x86). Really want masks everywhere and
optimize for the
actual requirements.

> The shrinking part scares me - I think it might be more prudent to first
> explore on how to grow it. The big problem looks to allocate a physical
> contiguity set of pages. And I guess SWIOTLB would need to change from
> using one big region to something of a pool system?
>

Shrinking: you define a movable zone, so with some delay it can be
always freed.

The problem with swiotlb is however it still cannot block, but it can
adapt to load.

The real fix would be blockable swiotlb, but the way drivers are set up
this is difficult
(at least in kernels using spinlocks)

>> allocator patchkit,
>> unfortunately that didn't go forward.
> I wasn't present at that time so I don't know what the issues were - you
> wouldn't have a link to LKML for this?

There wasn't all that much opposition, but I ran out of time because
fixing the infrastructure
(e.g. getting rid of all of GFP_DMA) is a lot of work. In a sense it's a
big tree sweep project like
getting rid of BKL.

The old patch kit is at ftp://firstfloor.org/pub/ak/dma/
"intro" has the rationale.

I have a slightly newer version of the SCSI & misc drivers patchkit
somewhere.

-Andi

2010-07-28 10:10:40

by FUJITA Tomonori

[permalink] [raw]

Subject: Re: swiotlb detection should be memory hotplug aware ?

On Fri, 23 Jul 2010 17:23:33 +0200
Andi Kleen <[email protected]> wrote:

>
> > I was thinking about this at some point. I think the first step is to
> > make SWIOTLB use the debugfs to actually print out how much of its
> > buffers are used - and see if the 64MB is a good fit.
>
> swiotlb is near always wrongly sized. For most system it's far too much,
> but for some
> not enough. I have some systemtap scripts around to instrument it.

True, it's impossible to preallocate the best iotlb size statically.

> Also it depends on the IO load, so if you size it reasonable you
> risk overflow on large IO (however these days this very rarely happens
> because
> all "serious" IO devices don't need swiotlb anymore)

Yeah, nowadays it's pointless to try to get the good performance with
swiotlb.

> The other problem is that using only two bits for the needed address
> space is also extremly
> inefficient (4GB and 16MB on x86). Really want masks everywhere and
> optimize for the
> actual requirements.

swiotlb doesn't allocate GFP_DMA memory. It handles only GFP_DMA32.

swiotlb doesn't work for drivers with some odd dma mask (non 32bit)
but we have been lived with it so I don't think that it's a big issue.

I think, supporting expanding swiotlb dynamically is enough. The
default swiotlb size, 64MB is too large for majority.

I have a half-baked patch for it. I'll send it later.

2010-07-28 14:09:52

by Andi Kleen

[permalink] [raw]

Subject: Re: swiotlb detection should be memory hotplug aware ?

FUJITA Tomonori <[email protected]> writes:
>
>> The other problem is that using only two bits for the needed address
>> space is also extremly
>> inefficient (4GB and 16MB on x86). Really want masks everywhere and
>> optimize for the
>> actual requirements.
>
> swiotlb doesn't allocate GFP_DMA memory. It handles only GFP_DMA32.

I was lumping GFP_DMA and swiotlb together here. The
pci_alloc_consistent() function uses both interchangedly.
They really effectively are the same thing these days
and just separated by historical accident.

> I have a half-baked patch for it. I'll send it later.

The problem are still the *_map users which usually cannot sleep,
and then it's difficult to grow.
For *_alloc it's relatively easy and to some extend already
implemented.

-Andi

--
[email protected] -- Speaking for myself only.

2010-07-28 14:21:35

by FUJITA Tomonori

[permalink] [raw]

Subject: Re: swiotlb detection should be memory hotplug aware ?

On Wed, 28 Jul 2010 13:09:57 +0200
Andi Kleen <[email protected]> wrote:

> FUJITA Tomonori <[email protected]> writes:
> >
> >> The other problem is that using only two bits for the needed address
> >> space is also extremly
> >> inefficient (4GB and 16MB on x86). Really want masks everywhere and
> >> optimize for the
> >> actual requirements.
> >
> > swiotlb doesn't allocate GFP_DMA memory. It handles only GFP_DMA32.
>
> I was lumping GFP_DMA and swiotlb together here. The
> pci_alloc_consistent() function uses both interchangedly.
> They really effectively are the same thing these days
> and just separated by historical accident.

Sorry, I meant to ZONE_DMA.

You are talking about your dma mask allocation patchset, right?

I meant that swiotlb doesn't need to handle ZONE_DMA. It handles only
devices that can handle ZONE_DMA32.

> > I have a half-baked patch for it. I'll send it later.
>
> The problem are still the *_map users which usually cannot sleep,
> and then it's difficult to grow.

Why we can't use GFP_NOWAIT?

My approach is starting with small (like 4MB) and increasing io_tbl by
chunk such as 4MB.