LinuxLists.cc - [Intel-IOMMU 02/10] Library routine for pre-allocat pool handling

[permalink] [raw]

Subject: Re: [Intel-IOMMU 02/10] Library routine for pre-allocat pool handling

On Wed, 06 Jun 2007 11:57:00 -0700
[email protected] wrote:

> Signed-off-by: Anil S Keshavamurthy <[email protected]>

That was a terse changelog.

Obvious question: how does this differ from mempools, and would it be
better to fill in any gaps in mempool functionality instead of
implementing something similar-looking?

The changelog very much should describe all this, as well as explaining
what the dynamic behaviour of this new thing is, and what applications are
envisaged, what problems it solves, etc, etc.

> --- /dev/null 1970-01-01 00:00:00.000000000 +0000
> +++ linux-2.6.22-rc3/lib/respool.c 2007-06-06 11:34:46.000000000 -0700

There are a number of coding-style glitches in here, but
scripts/checkpatch.pl catches most of them. Please run it, and fix.

> @@ -0,0 +1,222 @@
> +/*
> + * respool.c - library routines for handling generic pre-allocated pool of objects
> + *
> + * Copyright (c) 2006, Intel Corporation.
> + *
> + * This file is released under the GPLv2.
> + *
> + * Copyright (C) 2006 Anil S Keshavamurthy <[email protected]>
> + */
> +
> +#include <linux/respool.h>
> +
> +/**
> + * get_resource_pool_obj - gets an object from the pool
> + * @ppool - resource pool in question
> + * This function gets an object from the pool and
> + * if the pool count drops below min_count, this
> + * function schedules work to grow the pool. If
> + * no elements are fount in the pool then this function
> + * tries to get memory from kernel.
> + */
> +void * get_resource_pool_obj(struct resource_pool *ppool)
> +{
> + unsigned long flags;
> + struct list_head *plist;
> + bool queue_work = 0;
> +
> + spin_lock_irqsave(&ppool->pool_lock, flags);
> + if (!list_empty(&ppool->pool_head)) {
> + plist = ppool->pool_head.next;
> + list_del(plist);
> + ppool->curr_count--;
> + } else {
> + /*Making sure that curr_count is 0 when list is empty */
> + plist = NULL;
> + BUG_ON(ppool->curr_count != 0);
> + }
> +
> + /* Check if pool needs to grow */
> + if (ppool->curr_count <= ppool->min_count)
> + queue_work = 1;
> + spin_unlock_irqrestore(&ppool->pool_lock, flags);
> +
> + if (queue_work)
> + schedule_work(&ppool->work); /* queue work to grow the pool */
> +
> +
> + if (plist) {
> + memset(plist, 0, ppool->alloc_size); /* Zero out memory */
> + return plist;
> + }
> +
> + /* Out of luck, try to get memory from kernel */
> + plist = (struct list_head *)ppool->alloc_mem(ppool->alloc_size,
> + GFP_ATOMIC);
> +
> + return plist;
> +}

A function like this should take a gfp_t from the caller, and pass it on.

> +/**
> + * put_resource_pool_obj - puts an object back to the pool
> + * @vaddr - object's address
> + * @ppool - resource pool in question.
> + * This function puts an object back to the pool.
> + */
> +void put_resource_pool_obj(void * vaddr, struct resource_pool *ppool)
> +{
> + unsigned long flags;
> + struct list_head *plist = (struct list_head *)vaddr;
> + bool queue_work = 0;
> +
> + BUG_ON(!vaddr);
> + BUG_ON(!ppool);
> +
> + spin_lock_irqsave(&ppool->pool_lock, flags);
> + list_add(plist, &ppool->pool_head);
> + ppool->curr_count++;
> + if (ppool->curr_count > (ppool->min_count +
> + ppool->grow_count * 2))
> + queue_work = 1;

Some of the indenting is a bit funny-looking in here.

> + spin_unlock_irqrestore(&ppool->pool_lock, flags);
> +
> + if (queue_work)
> + schedule_work(&ppool->work); /* queue work to shrink the pool */
> +}
> +
> +void
> +__grow_resource_pool(struct resource_pool *ppool,
> + unsigned int grow_count)
> +{
> + unsigned long flags;
> + struct list_head *plist;
> +
> + while(grow_count) {
> + plist = (struct list_head *)ppool->alloc_mem(ppool->alloc_size,
> + GFP_KERNEL);

resource_pool.alloc_mem() already returns void *, so there is never a need
to cast its return value.

> + if (!plist)
> + break;
> +
> + /* Add the element to the list */
> + spin_lock_irqsave(&ppool->pool_lock, flags);
> + list_add(plist, &ppool->pool_head);
> + ppool->curr_count++;
> + spin_unlock_irqrestore(&ppool->pool_lock, flags);
> + grow_count--;
> + }
> +}
> +

2007-06-08 18:25:56

[permalink] [raw]

Subject: Re: [Intel-IOMMU 02/10] Library routine for pre-allocat pool handling

On Thu, Jun 07, 2007 at 04:27:26PM -0700, Andrew Morton wrote:
> On Wed, 06 Jun 2007 11:57:00 -0700
> [email protected] wrote:
>
> > Signed-off-by: Anil S Keshavamurthy <[email protected]>
>
> That was a terse changelog.
>
> Obvious question: how does this differ from mempools, and would it be
> better to fill in any gaps in mempool functionality instead of
> implementing something similar-looking?

Very good question. Mempool pre-allocates the elements
to the required minimum count size during its initilization time.
However when mempool_alloc() is called it tries to obtain the
element from OS and if that fails then it looks for the element in
its pool. If there are no elements in its pool and if the gpf_t
flags says it can wait then it waits untill someone puts the element
back to pool, else if gpf_t flag say it can;t wait then it returns NULL.
In other words, mempool acts as *emergency* pool, i.e only if the OS fails
to allocate the required memory, then the pool object is used.

In the IOMMU case, we need exactly opposite of what mempool provides,
i.e we always want to look for the element in the pool and if the pool
has no element then go to OS as a worst case. This resource pool
library routines do the same. Again, this resource pools
grows and shrinks automatically to maintain the minimum pool
elements in the background. I am not sure whether this totally
opposite functionality of mempools and resource pools can be
merged.

In fact the very first version of this IOMMU patch used mempools
and the performance was worse because mempool did not help as
IOMMU did a very frequent alloc and free of pool objects and
every call to alloc/free used to go to os. Andi Kleen,
noticied and told us that mempool usage for IOMMU is wrong and
hence we came up with resource pool concept.

>
> The changelog very much should describe all this, as well as explaining
> what the dynamic behaviour of this new thing is, and what applications are
> envisaged, what problems it solves, etc, etc.

I can gladly update the changelog if the resource pool concept is
approved. I will fix all the below minor comments.

I envision that this might be useful for all vendor's (IBM, AMD, Intel, etc) IOMMU driver
and for any kernel component which does lots of dynamic alloc/free an object of same size.

thanks,
Anil

2007-06-08 19:01:40

[permalink] [raw]

Subject: Re: [Intel-IOMMU 02/10] Library routine for pre-allocat pool handling

On Fri, 8 Jun 2007 11:21:57 -0700
"Keshavamurthy, Anil S" <[email protected]> wrote:

> On Thu, Jun 07, 2007 at 04:27:26PM -0700, Andrew Morton wrote:
> > On Wed, 06 Jun 2007 11:57:00 -0700
> > [email protected] wrote:
> >
> > > Signed-off-by: Anil S Keshavamurthy <[email protected]>
> >
> > That was a terse changelog.
> >
> > Obvious question: how does this differ from mempools, and would it be
> > better to fill in any gaps in mempool functionality instead of
> > implementing something similar-looking?
>
> Very good question. Mempool pre-allocates the elements
> to the required minimum count size during its initilization time.
> However when mempool_alloc() is called it tries to obtain the
> element from OS and if that fails then it looks for the element in
> its pool. If there are no elements in its pool and if the gpf_t
> flags says it can wait then it waits untill someone puts the element
> back to pool, else if gpf_t flag say it can;t wait then it returns NULL.
> In other words, mempool acts as *emergency* pool, i.e only if the OS fails
> to allocate the required memory, then the pool object is used.
>
>
> In the IOMMU case, we need exactly opposite of what mempool provides,
> i.e we always want to look for the element in the pool and if the pool
> has no element then go to OS as a worst case. This resource pool
> library routines do the same. Again, this resource pools
> grows and shrinks automatically to maintain the minimum pool
> elements in the background. I am not sure whether this totally
> opposite functionality of mempools and resource pools can be
> merged.

Confused.

If resource pools are not designed to provide extra robustness via an
emergency pool, then what _are_ they designed for? (Boy this is a hard way
to write a changelog!)

> In fact the very first version of this IOMMU patch used mempools
> and the performance was worse because mempool did not help as
> IOMMU did a very frequent alloc and free of pool objects and
> every call to alloc/free used to go to os. Andi Kleen,
> noticied and told us that mempool usage for IOMMU is wrong and
> hence we came up with resource pool concept.

You _seem_ to be saying that the resource pools are there purely for
alloc/free performance reasons. If so, I'd be skeptical: slab is pretty
darned fast.

> >
> > The changelog very much should describe all this, as well as explaining
> > what the dynamic behaviour of this new thing is, and what applications are
> > envisaged, what problems it solves, etc, etc.
>
> I can gladly update the changelog if the resource pool concept is
> approved. I will fix all the below minor comments.
>
> I envision that this might be useful for all vendor's (IBM, AMD, Intel, etc) IOMMU driver
> and for any kernel component which does lots of dynamic alloc/free an object of same size.
>

That's what kmem_cache_alloc() is for?!?!

2007-06-08 20:15:54

[permalink] [raw]

Subject: Re: [Intel-IOMMU 02/10] Library routine for pre-allocat pool handling

On Fri, Jun 08, 2007 at 12:01:07PM -0700, Andrew Morton wrote:
> On Fri, 8 Jun 2007 11:21:57 -0700
> "Keshavamurthy, Anil S" <[email protected]> wrote:
>
> > On Thu, Jun 07, 2007 at 04:27:26PM -0700, Andrew Morton wrote:
> > > On Wed, 06 Jun 2007 11:57:00 -0700
> > > [email protected] wrote:
> > >
> > > > Signed-off-by: Anil S Keshavamurthy <[email protected]>
> > >
> > > That was a terse changelog.
> > >
> > > Obvious question: how does this differ from mempools, and would it be
> > > better to fill in any gaps in mempool functionality instead of
> > > implementing something similar-looking?
> >
> > Very good question. Mempool pre-allocates the elements
> > to the required minimum count size during its initilization time.
> > However when mempool_alloc() is called it tries to obtain the
> > element from OS and if that fails then it looks for the element in
> > its pool. If there are no elements in its pool and if the gpf_t
> > flags says it can wait then it waits untill someone puts the element
> > back to pool, else if gpf_t flag say it can;t wait then it returns NULL.
> > In other words, mempool acts as *emergency* pool, i.e only if the OS fails
> > to allocate the required memory, then the pool object is used.
> >
> >
> > In the IOMMU case, we need exactly opposite of what mempool provides,
> > i.e we always want to look for the element in the pool and if the pool
> > has no element then go to OS as a worst case. This resource pool
> > library routines do the same. Again, this resource pools
> > grows and shrinks automatically to maintain the minimum pool
> > elements in the background. I am not sure whether this totally
> > opposite functionality of mempools and resource pools can be
> > merged.
>
> Confused.
>
> If resource pools are not designed to provide extra robustness via an
> emergency pool, then what _are_ they designed for? (Boy this is a hard way
> to write a changelog!)

The resource pool indeed provide extra robustness, the initial pool size will
be equal to min_count + grow_count. If the pool object count goes below
min_count, then pool grows in the background while serving as emergency
pool with min_count of objects in it. If we run out of emergency pool objects
before the pool grow in the background, then we go to OS for allocation.

Similary, if the pool objects grows above the max threshold,
the objects are freed to OS in the background thread maintaining
the pool objects close to min_count + grow_count size.

>
> > In fact the very first version of this IOMMU patch used mempools
> > and the performance was worse because mempool did not help as
> > IOMMU did a very frequent alloc and free of pool objects and
> > every call to alloc/free used to go to os. Andi Kleen,
> > noticied and told us that mempool usage for IOMMU is wrong and
> > hence we came up with resource pool concept.
>
> You _seem_ to be saying that the resource pools are there purely for
> alloc/free performance reasons. If so, I'd be skeptical: slab is pretty
> darned fast.
We need several objects of size say( 4 * sizeof(u64)) and reuse
them in dma map/unmap api calls for managing io virtual allocation address that
this driver has dished out. Hence having pool of objects where we put
the element in the linked list and and get it from the linked list is pretty
fast compared to slab.
>
> > >
> > > The changelog very much should describe all this, as well as explaining
> > > what the dynamic behaviour of this new thing is, and what applications are
> > > envisaged, what problems it solves, etc, etc.
> >
> > I can gladly update the changelog if the resource pool concept is
> > approved. I will fix all the below minor comments.
> >
> > I envision that this might be useful for all vendor's (IBM, AMD, Intel, etc) IOMMU driver
> > and for any kernel component which does lots of dynamic alloc/free an object of same size.
> >
>
> That's what kmem_cache_alloc() is for?!?!
We had this kmem_cache_alloc() with mempool concept earlier and Andi suggest to
come up with something pre-allocated pool.
Andi, Can you chime in please.

-Anil

2007-06-08 20:43:22

[permalink] [raw]

Subject: Re: [Intel-IOMMU 02/10] Library routine for pre-allocat pool handling

Am Fr 08.06.2007 21:01 schrieb Andrew Morton
<[email protected]>:

> On Fri, 8 Jun 2007 11:21:57 -0700
> "Keshavamurthy, Anil S" <[email protected]> wrote:
>
> > On Thu, Jun 07, 2007 at 04:27:26PM -0700, Andrew Morton wrote:
> > > On Wed, 06 Jun 2007 11:57:00 -0700
> > > [email protected] wrote:
> > >
> > > > Signed-off-by: Anil S Keshavamurthy
> > > > <[email protected]>
> > >
> > > That was a terse changelog.
> > >
> > > Obvious question: how does this differ from mempools, and would it
> > > be
> > > better to fill in any gaps in mempool functionality instead of
> > > implementing something similar-looking?
> >
> > Very good question. Mempool pre-allocates the elements
> > to the required minimum count size during its initilization time.
> > However when mempool_alloc() is called it tries to obtain the
> > element from OS and if that fails then it looks for the element in
> > its pool. If there are no elements in its pool and if the gpf_t
> > flags says it can wait then it waits untill someone puts the element
> > back to pool, else if gpf_t flag say it can;t wait then it returns
> > NULL.
> > In other words, mempool acts as *emergency* pool, i.e only if the OS
> > fails
> > to allocate the required memory, then the pool object is used.
> >
> >
> > In the IOMMU case, we need exactly opposite of what mempool
> > provides,
> > i.e we always want to look for the element in the pool and if the
> > pool
> > has no element then go to OS as a worst case. This resource pool
> > library routines do the same. Again, this resource pools
> > grows and shrinks automatically to maintain the minimum pool
> > elements in the background. I am not sure whether this totally
> > opposite functionality of mempools and resource pools can be
> > merged.
>
> Confused.
>
> If resource pools are not designed to provide extra robustness via an
> emergency pool, then what _are_ they designed for? (Boy this is a hard
> way
> to write a changelog!)

mempools are designed to manage a limited resource pool by sleeping
if necessary until someone else frees a resource. It's basically similar
how to main VM works with a sleeping allocation, just in a "private user
group"

In the IOMMU case sleeping is not allowed because pci_map_* typically
happens inside spinlocks. But the IOMMU code might need to allocate
new page tables and other datastructures in there.

This means mempools don't work for those (the previous version had non
sensical
constructs like GFP_ATOMIC mempool calls)

I haven't looked at Anil's code, but I suspect the only really robust
way to handle this case is to always preallocate everything. But I'm not
sure
why that would need new library functions; it should be just some simple
lists that could be open coded.

If it needs to fall back to the OS for any non pre allocation then it
will
likely be flakey under high load. Now that might be ok in some cases
-- apparently block layer is much better at handling this than it used
to be and networking has to handle it anyways, but it might be still
a unpleasant surprise for many drivers. One generic problem is that
there are no upcalls when such resources become avaialable again
so the upper layers would need to poll to know when to resubmit
a request.

It's a pretty messy problem unfortunately.

One relatively easy way out would be to just preallocate
a static aperture fully and always map into it. Not sure
how much memory that would need -- when it's too large
it might take a lot of memory for page tables always and when it's
too small it might overflow under high load.

> That's what kmem_cache_alloc() is for?!?!

Tradtionally that was not allowed in block layer path. Not sure
it is fully obsolete with the recent dirty tracking work, probably not.

Besides it would need to be GFP_ATOMIC and the default
atomic pools are not that big.

-Andi

2007-06-08 20:44:31

by Suresh Siddha

[permalink] [raw]

Subject: Re: [Intel-IOMMU 02/10] Library routine for pre-allocat pool handling

On Fri, Jun 08, 2007 at 01:12:00PM -0700, Keshavamurthy, Anil S wrote:
> The resource pool indeed provide extra robustness, the initial pool size will
> be equal to min_count + grow_count. If the pool object count goes below
> min_count, then pool grows in the background while serving as emergency
> pool with min_count of objects in it. If we run out of emergency pool objects
> before the pool grow in the background, then we go to OS for allocation.
>
> Similary, if the pool objects grows above the max threshold,
> the objects are freed to OS in the background thread maintaining
> the pool objects close to min_count + grow_count size.

slab already has this and it has additional functionalities like reaping
over time, when there is no activity...

> We need several objects of size say( 4 * sizeof(u64)) and reuse
> them in dma map/unmap api calls for managing io virtual allocation address that
> this driver has dished out. Hence having pool of objects where we put
> the element in the linked list and and get it from the linked list is pretty
> fast compared to slab.

Not sure how is this fast compared to slab. Atleast slab is lockless in the
fast case..

> We had this kmem_cache_alloc() with mempool concept earlier and Andi suggest to
> come up with something pre-allocated pool.
> Andi, Can you chime in please.

In the initial patches, only for iova we were using slabs + mempool. But
for others like pgtable_mempool, we were using simple mempools.

Even slabs + mempool is not same as just usng slab.. slab is lockless
for the fast case.

thanks,
suresh

2007-06-08 20:44:47

[permalink] [raw]

Subject: Re: [Intel-IOMMU 02/10] Library routine for pre-allocat pool handling

On Fri, 8 Jun 2007 13:12:00 -0700
"Keshavamurthy, Anil S" <[email protected]> wrote:

> On Fri, Jun 08, 2007 at 12:01:07PM -0700, Andrew Morton wrote:
> > On Fri, 8 Jun 2007 11:21:57 -0700
> > "Keshavamurthy, Anil S" <[email protected]> wrote:
> >
> > > On Thu, Jun 07, 2007 at 04:27:26PM -0700, Andrew Morton wrote:
> > > > On Wed, 06 Jun 2007 11:57:00 -0700
> > > > [email protected] wrote:
> > > >
> > > > > Signed-off-by: Anil S Keshavamurthy <[email protected]>
> > > >
> > > > That was a terse changelog.
> > > >
> > > > Obvious question: how does this differ from mempools, and would it be
> > > > better to fill in any gaps in mempool functionality instead of
> > > > implementing something similar-looking?
> > >
> > > Very good question. Mempool pre-allocates the elements
> > > to the required minimum count size during its initilization time.
> > > However when mempool_alloc() is called it tries to obtain the
> > > element from OS and if that fails then it looks for the element in
> > > its pool. If there are no elements in its pool and if the gpf_t
> > > flags says it can wait then it waits untill someone puts the element
> > > back to pool, else if gpf_t flag say it can;t wait then it returns NULL.
> > > In other words, mempool acts as *emergency* pool, i.e only if the OS fails
> > > to allocate the required memory, then the pool object is used.
> > >
> > >
> > > In the IOMMU case, we need exactly opposite of what mempool provides,
> > > i.e we always want to look for the element in the pool and if the pool
> > > has no element then go to OS as a worst case. This resource pool
> > > library routines do the same. Again, this resource pools
> > > grows and shrinks automatically to maintain the minimum pool
> > > elements in the background. I am not sure whether this totally
> > > opposite functionality of mempools and resource pools can be
> > > merged.
> >
> > Confused.
> >
> > If resource pools are not designed to provide extra robustness via an
> > emergency pool, then what _are_ they designed for? (Boy this is a hard way
> > to write a changelog!)
>
> The resource pool indeed provide extra robustness, the initial pool size will
> be equal to min_count + grow_count. If the pool object count goes below
> min_count, then pool grows in the background while serving as emergency
> pool with min_count of objects in it. If we run out of emergency pool objects
> before the pool grow in the background, then we go to OS for allocation.

This wholly duplicates kswapd functionality.

> Similary, if the pool objects grows above the max threshold,
> the objects are freed to OS in the background thread maintaining
> the pool objects close to min_count + grow_count size.

That problem was _introduced_ by resource-pools, so yes, it also needs to
be solved there.

>
> >
> > > In fact the very first version of this IOMMU patch used mempools
> > > and the performance was worse because mempool did not help as
> > > IOMMU did a very frequent alloc and free of pool objects and
> > > every call to alloc/free used to go to os. Andi Kleen,
> > > noticied and told us that mempool usage for IOMMU is wrong and
> > > hence we came up with resource pool concept.
> >
> > You _seem_ to be saying that the resource pools are there purely for
> > alloc/free performance reasons. If so, I'd be skeptical: slab is pretty
> > darned fast.
> We need several objects of size say( 4 * sizeof(u64)) and reuse
> them in dma map/unmap api calls for managing io virtual allocation address that
> this driver has dished out. Hence having pool of objects where we put
> the element in the linked list and and get it from the linked list is pretty
> fast compared to slab.

slab is fast (and IO is slow!). Do you have benchmark results?

> >
> > > >
> > > > The changelog very much should describe all this, as well as explaining
> > > > what the dynamic behaviour of this new thing is, and what applications are
> > > > envisaged, what problems it solves, etc, etc.
> > >
> > > I can gladly update the changelog if the resource pool concept is
> > > approved. I will fix all the below minor comments.
> > >
> > > I envision that this might be useful for all vendor's (IBM, AMD, Intel, etc) IOMMU driver
> > > and for any kernel component which does lots of dynamic alloc/free an object of same size.
> > >
> >
> > That's what kmem_cache_alloc() is for?!?!
> We had this kmem_cache_alloc() with mempool concept earlier and Andi suggest to
> come up with something pre-allocated pool.
> Andi, Can you chime in please.

2007-06-08 20:56:23

[permalink] [raw]

Subject: Re: [Intel-IOMMU 02/10] Library routine for pre-allocat pool handling

On Fri, 8 Jun 2007 22:43:10 +0200 (CEST)
Andreas Kleen <[email protected]> wrote:

> > That's what kmem_cache_alloc() is for?!?!
>
> Tradtionally that was not allowed in block layer path. Not sure
> it is fully obsolete with the recent dirty tracking work, probably not.
>
> Besides it would need to be GFP_ATOMIC and the default
> atomic pools are not that big.

That in itself is a problem. What do we have to do to be able
to make these allocations use the *much* stronger GFP_NOIO?

We could perhaps talk to Christoph about arranging for each slabcache to
have an optional private reserve page pool. But fixing the GFP_ATOMIC
would be better.

2007-06-08 21:24:48

[permalink] [raw]

Subject: Re: [Intel-IOMMU 02/10] Library routine for pre-allocat pool handling

On Fri, Jun 08, 2007 at 10:43:10PM +0200, Andreas Kleen wrote:
> Am Fr 08.06.2007 21:01 schrieb Andrew Morton
> <[email protected]>:
>
> > On Fri, 8 Jun 2007 11:21:57 -0700
> > "Keshavamurthy, Anil S" <[email protected]> wrote:
> >
> > > On Thu, Jun 07, 2007 at 04:27:26PM -0700, Andrew Morton wrote:
> > > > On Wed, 06 Jun 2007 11:57:00 -0700
> > > > [email protected] wrote:
> > > >
> > > > > Signed-off-by: Anil S Keshavamurthy
> > > > > <[email protected]>
> > > >
> > > > That was a terse changelog.
> > > >
> > > > Obvious question: how does this differ from mempools, and would it
> > > > be
> > > > better to fill in any gaps in mempool functionality instead of
> > > > implementing something similar-looking?
> > >
> > > Very good question. Mempool pre-allocates the elements
> > > to the required minimum count size during its initilization time.
> > > However when mempool_alloc() is called it tries to obtain the
> > > element from OS and if that fails then it looks for the element in
> > > its pool. If there are no elements in its pool and if the gpf_t
> > > flags says it can wait then it waits untill someone puts the element
> > > back to pool, else if gpf_t flag say it can;t wait then it returns
> > > NULL.
> > > In other words, mempool acts as *emergency* pool, i.e only if the OS
> > > fails
> > > to allocate the required memory, then the pool object is used.
> > >
> > >
> > > In the IOMMU case, we need exactly opposite of what mempool
> > > provides,
> > > i.e we always want to look for the element in the pool and if the
> > > pool
> > > has no element then go to OS as a worst case. This resource pool
> > > library routines do the same. Again, this resource pools
> > > grows and shrinks automatically to maintain the minimum pool
> > > elements in the background. I am not sure whether this totally
> > > opposite functionality of mempools and resource pools can be
> > > merged.
> >
> > Confused.
> >
> > If resource pools are not designed to provide extra robustness via an
> > emergency pool, then what _are_ they designed for? (Boy this is a hard
> > way
> > to write a changelog!)
>
> mempools are designed to manage a limited resource pool by sleeping
> if necessary until someone else frees a resource. It's basically similar
> how to main VM works with a sleeping allocation, just in a "private user
> group"
>
> In the IOMMU case sleeping is not allowed because pci_map_* typically
> happens inside spinlocks.? But the IOMMU code might need to allocate
> new page tables and other datastructures in there.
>
> This means mempools don't work for those (the previous version had non
> sensical
> constructs like GFP_ATOMIC mempool calls)
>
> ?I haven't looked at Anil's code, but I suspect the only really robust
> way to handle this case is to always preallocate everything. But I'm not
> sure
> why that would need new library functions; it should be just some simple
> lists that could be open coded.

Since it is practically impossible to predicit how much to preallocate,
we have a min_count+grow_count of object allocated and we always use from this
pool. If the object count goes below certain low threshold(which acts as
emergency pool from this point), the pool grows by allocating and
adding the newly allocated object into the pool in the
worker (keventd) thread.
Again, once the IO pressure is over, the PCI driver
does the unmap calls and we put back the objects back to preallocate pools.
The smartness is builtin to the pool as the elements are put back to the
pool it detects that the pool count is greater then the threshold and
it automagically queues the work to free the objects and bring back the
pre-allocated object count back to minimum threshold.
Thus this preallocated pool grow and shrinks based on the demand, while
acting as both pre-allocated pools and as emergency pool.

Currently I have made this as a libray functions, if that is not correct,
we can pull this and make it part of the Intel IOMMU driver itself.
Please do let me know your suggestions.

>
> If it needs to fall back to the OS for any non pre allocation then it
> will
> likely be flakey under high load. Now that might be ok in some cases
> -- apparently block layer is much better at handling this than it used
> to be and networking has to handle it anyways, but it might be still
> a unpleasant surprise for many drivers. One generic problem is that
> there are no upcalls when such resources become avaialable again
> so the upper layers would need to poll to know when to resubmit
> a request.
>
> It's a pretty messy problem unfortunately.
I agree, worst case if the element is not available in the
pool we need to fall back to the OS and if OS fails then it
is tough luck.

>
> One relatively easy way out would be to just preallocate
> a static aperture fully and always map into it. Not sure
> how much memory that would need -- when it's too large
> it might take a lot of memory for page tables always and when it's
> too small it might overflow under high load.
Yup, and since we need to use the same driver from
desktop to servers, preallocation count for servers
many not be suitable for desktops.

>
> > That's what kmem_cache_alloc() is for?!?!
>
> Tradtionally that was not allowed in block layer path. Not sure
> it is fully obsolete with the recent dirty tracking work, probably not.
>
> Besides it would need to be GFP_ATOMIC and the default
> atomic pools are not that big.
>
> -Andi

2007-06-08 21:42:32

[permalink] [raw]

Subject: Re: [Intel-IOMMU 02/10] Library routine for pre-allocat pool handling

On Fri, 8 Jun 2007 14:20:54 -0700
"Keshavamurthy, Anil S" <[email protected]> wrote:

> > This means mempools don't work for those (the previous version had non
> > sensical
> > constructs like GFP_ATOMIC mempool calls)
> >
> > __I haven't looked at Anil's code, but I suspect the only really robust
> > way to handle this case is to always preallocate everything. But I'm not
> > sure
> > why that would need new library functions; it should be just some simple
> > lists that could be open coded.
>
> Since it is practically impossible to predicit how much to preallocate,
> we have a min_count+grow_count of object allocated and we always use from this
> pool. If the object count goes below certain low threshold(which acts as
> emergency pool from this point), the pool grows by allocating and
> adding the newly allocated object into the pool in the
> worker (keventd) thread.

Asking keventd to do this might be problematic: there may be code in various
dark corners of device drivers which also depend upon keventd services for
IO completion, in which case there might be obscure deadlocks, dunno.
otoh, keventd already surely does GFP_KERNEL allocations...

But still, the whole thing seems pointless: kswapd is already doing all of
this, replenishing the page reserves. So why not use that?

> Again, once the IO pressure is over, the PCI driver
> does the unmap calls and we put back the objects back to preallocate pools.
> The smartness is builtin to the pool as the elements are put back to the
> pool it detects that the pool count is greater then the threshold and
> it automagically queues the work to free the objects and bring back the
> pre-allocated object count back to minimum threshold.
> Thus this preallocated pool grow and shrinks based on the demand, while
> acting as both pre-allocated pools and as emergency pool.
>
> Currently I have made this as a libray functions, if that is not correct,
> we can pull this and make it part of the Intel IOMMU driver itself.
> Please do let me know your suggestions.

I'd say just remove the whole thing and use kmem_cache_alloc().

Put much effort into removing the GFP_ATOMIC and using GFP_NOIO instead:
there's your problem right there.

If for some reason you really can't do that (and a requirement for
allocation-in-interrupt is the only valid reason, really) and if you indeed
can demonstrate memory allocation failures with certain workloads then
let's take a look at that. As I said, attaching a reserve pool to your
slab cache might be a suitable approach. But none of these things are
magic: if memory allcoation failures or deadlocks or livelocks are
demonstrable with the reserves absent, then they'll also be possible with
the reserves present.

Unless you use mempools, and can sleep.

2007-06-08 22:18:39

[permalink] [raw]

Subject: Re: [Intel-IOMMU 02/10] Library routine for pre-allocat pool handling

Andrew Morton wrote:
> Put much effort into removing the GFP_ATOMIC and using GFP_NOIO instead:
> there's your problem right there.
>
> If for some reason you really can't do that (and a requirement for
> allocation-in-interrupt is the only valid reason, really)

and that's the case here; IO gets submitted from IRQ handlers (both
network and block).......

2007-06-08 22:21:59

by Suresh Siddha

[permalink] [raw]

Subject: Re: [Intel-IOMMU 02/10] Library routine for pre-allocat pool handling

On Fri, Jun 08, 2007 at 02:42:07PM -0700, Andrew Morton wrote:
> I'd say just remove the whole thing and use kmem_cache_alloc().

We will try that.

> Put much effort into removing the GFP_ATOMIC and using GFP_NOIO instead:
> there's your problem right there.

As these are called from interrupt handlers, we can't use GFP_NOIO.

> If for some reason you really can't do that (and a requirement for
> allocation-in-interrupt is the only valid reason, really) and if you indeed
> can demonstrate memory allocation failures with certain workloads then
> let's take a look at that. As I said, attaching a reserve pool to your
> slab cache might be a suitable approach. But none of these things are

I agree. We are better off with enhancing slab infrastructure for this, if
needed.

> magic: if memory allcoation failures or deadlocks or livelocks are
> demonstrable with the reserves absent, then they'll also be possible with
> the reserves present.
>
> Unless you use mempools, and can sleep.

2007-06-08 22:32:07

[permalink] [raw]

Subject: Re: [Intel-IOMMU 02/10] Library routine for pre-allocat pool handling

On Friday 08 June 2007 22:55, Andrew Morton wrote:
> On Fri, 8 Jun 2007 22:43:10 +0200 (CEST)
>
> Andreas Kleen <[email protected]> wrote:
> > > That's what kmem_cache_alloc() is for?!?!
> >
> > Tradtionally that was not allowed in block layer path. Not sure
> > it is fully obsolete with the recent dirty tracking work, probably not.
> >
> > Besides it would need to be GFP_ATOMIC and the default
> > atomic pools are not that big.
>
> That in itself is a problem. What do we have to do to be able
> to make these allocations use the *much* stronger GFP_NOIO?

That still sleeps.

Allow networking and block drivers (and other device drivers) to
map SGs without holding spinlocks.

Would be major work. I've been asking for it for years
because it would also avoid other nasty problem, like the panic-on-overflow
issues with swiotlb/AMD GART iommu -- it could just block for free
space instead.

-Andi

2007-06-08 22:32:37

[permalink] [raw]

Subject: Re: [Intel-IOMMU 02/10] Library routine for pre-allocat pool handling

On Fri, 8 Jun 2007, Keshavamurthy, Anil S wrote:

> In the IOMMU case, we need exactly opposite of what mempool provides,
> i.e we always want to look for the element in the pool and if the pool
> has no element then go to OS as a worst case. This resource pool
> library routines do the same. Again, this resource pools
> grows and shrinks automatically to maintain the minimum pool
> elements in the background. I am not sure whether this totally
> opposite functionality of mempools and resource pools can be
> merged.

What functionality are you missing in the page allocator? It seems that is
does what you want?

2007-06-08 22:34:40

[permalink] [raw]

Subject: Re: [Intel-IOMMU 02/10] Library routine for pre-allocat pool handling

On Fri, 8 Jun 2007, Keshavamurthy, Anil S wrote:

> > You _seem_ to be saying that the resource pools are there purely for
> > alloc/free performance reasons. If so, I'd be skeptical: slab is pretty
> > darned fast.
> We need several objects of size say( 4 * sizeof(u64)) and reuse
> them in dma map/unmap api calls for managing io virtual allocation address that
> this driver has dished out. Hence having pool of objects where we put
> the element in the linked list and and get it from the linked list is pretty
> fast compared to slab.

SLUB also manages objects using a linked list. Is there a real performance
difference?

2007-06-08 22:36:18

[permalink] [raw]

Subject: Re: [Intel-IOMMU 02/10] Library routine for pre-allocat pool handling

On Fri, 8 Jun 2007, Andreas Kleen wrote:

> > That's what kmem_cache_alloc() is for?!?!
>
> Tradtionally that was not allowed in block layer path. Not sure
> it is fully obsolete with the recent dirty tracking work, probably not.

Why was it not allowed? Because interrupts are disabled?

> Besides it would need to be GFP_ATOMIC and the default
> atomic pools are not that big.

Those could be increased. I think Mel has them already increased in mm.

2007-06-08 22:38:17

[permalink] [raw]

Subject: Re: [Intel-IOMMU 02/10] Library routine for pre-allocat pool handling

On Fri, 8 Jun 2007, Siddha, Suresh B wrote:

> > If for some reason you really can't do that (and a requirement for
> > allocation-in-interrupt is the only valid reason, really) and if you indeed
> > can demonstrate memory allocation failures with certain workloads then
> > let's take a look at that. As I said, attaching a reserve pool to your
> > slab cache might be a suitable approach. But none of these things are
>
> I agree. We are better off with enhancing slab infrastructure for this, if
> needed.

The slab allocators already use the page allocators atomic reserves if
called with GFP_ATOMIC.

2007-06-08 22:49:34

[permalink] [raw]

Subject: Re: [Intel-IOMMU 02/10] Library routine for pre-allocat pool handling

On Fri, Jun 08, 2007 at 03:32:08PM -0700, Christoph Lameter wrote:
> On Fri, 8 Jun 2007, Keshavamurthy, Anil S wrote:
>
> > In the IOMMU case, we need exactly opposite of what mempool provides,
> > i.e we always want to look for the element in the pool and if the pool
> > has no element then go to OS as a worst case. This resource pool
> > library routines do the same. Again, this resource pools
> > grows and shrinks automatically to maintain the minimum pool
> > elements in the background. I am not sure whether this totally
> > opposite functionality of mempools and resource pools can be
> > merged.
>
> What functionality are you missing in the page allocator? It seems that is
> does what you want?
Humm..I basically want to allocate memory during interrupt context and
expect not to fail. I know this is a hard requirement :)
I want to be able to reserve certain amount of memory specifically for
IOMMU purpose.

-Anil

2007-06-08 22:53:26

[permalink] [raw]

Subject: Re: [Intel-IOMMU 02/10] Library routine for pre-allocat pool handling

On Fri, Jun 08, 2007 at 03:33:39PM -0700, Christoph Lameter wrote:
> On Fri, 8 Jun 2007, Keshavamurthy, Anil S wrote:
>
> > > You _seem_ to be saying that the resource pools are there purely for
> > > alloc/free performance reasons. If so, I'd be skeptical: slab is pretty
> > > darned fast.
> > We need several objects of size say( 4 * sizeof(u64)) and reuse
> > them in dma map/unmap api calls for managing io virtual allocation address that
> > this driver has dished out. Hence having pool of objects where we put
> > the element in the linked list and and get it from the linked list is pretty
> > fast compared to slab.
>
> SLUB also manages objects using a linked list. Is there a real performance
> difference?

Sorry, I have not tried using SLUB, I will surely check this out.

-Anil

2007-06-08 22:55:20

[permalink] [raw]

Subject: Re: [Intel-IOMMU 02/10] Library routine for pre-allocat pool handling

On Fri, 8 Jun 2007, Keshavamurthy, Anil S wrote:

> > What functionality are you missing in the page allocator? It seems that is
> > does what you want?
> Humm..I basically want to allocate memory during interrupt context and
> expect not to fail. I know this is a hard requirement :)

The page allocator can do that for you. The reserves are configurable. Not
failing is a thing unseen in the computer world.

> I want to be able to reserve certain amount of memory specifically for
> IOMMU purpose.

That is of course a problem unless you allocate memory beforehand.
Mempool?

2007-06-08 22:57:24

[permalink] [raw]

Subject: Re: [Intel-IOMMU 02/10] Library routine for pre-allocat pool handling

On Saturday 09 June 2007 00:36, Christoph Lameter wrote:
> On Fri, 8 Jun 2007, Andreas Kleen wrote:
> > > That's what kmem_cache_alloc() is for?!?!
> >
> > Tradtionally that was not allowed in block layer path. Not sure
> > it is fully obsolete with the recent dirty tracking work, probably not.
>
> Why was it not allowed? Because interrupts are disabled?

Allocating memory during page out under low memory could
lead to deadlocks. That is because Linux used to make no attempt
to limit dirty pages for anonymous mappings and then you could
end up with most of your memory dirty and not enough
memory cleanable for page out and then when page out
needs more memory you could be dead.

[yes that implies that mmap over NFS was always broken]

Now there is a anon dirty limit since a few releases, but I'm not
fully convinced it solves the problem completely.

> > Besides it would need to be GFP_ATOMIC and the default
> > atomic pools are not that big.
>
> Those could be increased. I think Mel has them already increased in mm.

That would be lots of wasted memory. I already got >100 MB free
on my workstation under steady caching state. Already far too much imho.

-Andi

2007-06-08 22:59:40

[permalink] [raw]

Subject: Re: [Intel-IOMMU 02/10] Library routine for pre-allocat pool handling

On Sat, 9 Jun 2007, Andi Kleen wrote:

> > Why was it not allowed? Because interrupts are disabled?
>
> Allocating memory during page out under low memory could
> lead to deadlocks. That is because Linux used to make no attempt
> to limit dirty pages for anonymous mappings and then you could
> end up with most of your memory dirty and not enough
> memory cleanable for page out and then when page out
> needs more memory you could be dead.
>
> [yes that implies that mmap over NFS was always broken]

Right. We got that fixed in 2.6.19.

> Now there is a anon dirty limit since a few releases, but I'm not
> fully convinced it solves the problem completely.

A gut feeling or is there more?

2007-06-09 10:09:55

[permalink] [raw]

Subject: Re: [Intel-IOMMU 02/10] Library routine for pre-allocat pool handling

> > Now there is a anon dirty limit since a few releases, but I'm not
> > fully convinced it solves the problem completely.
>
> A gut feeling or is there more?

Lots of other subsystem can allocate a lot of memory
and they usually don't cooperate and have similar dirty limit concepts.
So you could run out of usable memory anyways and then have a similar
issue.

For example a flood of network packets could always steal your
GFP_ATOMIC pools very quickly in the background (gigabit or 10gig
can transfer a lot of data very quickly)

Also iirc try_to_free_pages() is not completely fair and might fail
under extreme load for some requesters.

Not requiring memory allocation for any IO would be certainly safer.

Anyways, it's a theoretic question because you can't sleep in
there anyways unless something drastic changes in the driver interfaces.

-Andi

2007-06-10 16:40:29

[permalink] [raw]

Subject: Re: [Intel-IOMMU 02/10] Library routine for pre-allocat pool handling

Christoph Lameter wrote:
> On Fri, 8 Jun 2007, Keshavamurthy, Anil S wrote:
>
>>> What functionality are you missing in the page allocator? It seems that is
>>> does what you want?
>> Humm..I basically want to allocate memory during interrupt context and
>> expect not to fail. I know this is a hard requirement :)
>
> The page allocator can do that for you. The reserves are configurable. Not
> failing is a thing unseen in the computer world.

but the page allocator reserve is shared.. and you will need this one
EXACTLY when the shared pool is getting low... it's not an
uncorrelated thing!

2007-06-11 16:10:26

[permalink] [raw]

Subject: Re: [Intel-IOMMU 02/10] Library routine for pre-allocat pool handling

On Sun, 10 Jun 2007, Arjan van de Ven wrote:

> Christoph Lameter wrote:
> > On Fri, 8 Jun 2007, Keshavamurthy, Anil S wrote:
> >
> > > > What functionality are you missing in the page allocator? It seems that
> > > > is does what you want?
> > > Humm..I basically want to allocate memory during interrupt context and
> > > expect not to fail. I know this is a hard requirement :)
> >
> > The page allocator can do that for you. The reserves are configurable. Not
> > failing is a thing unseen in the computer world.
>
> but the page allocator reserve is shared.. and you will need this one EXACTLY
> when the shared pool is getting low... it's not an uncorrelated thing!

I think what needs to be done first is to define under what conditions the
allocation should not fail. There is certainly no way to have GFP_ATOMIC
allocation that never fail. After all memory is limited and at some point
we need to reclaim memory.

2007-06-11 20:48:45

[permalink] [raw]

Subject: Re: [Intel-IOMMU 02/10] Library routine for pre-allocat pool handling

On Sat, Jun 09, 2007 at 11:47:23AM +0200, Andi Kleen wrote:
>
> > > Now there is a anon dirty limit since a few releases, but I'm not
> > > fully convinced it solves the problem completely.
> >
> > A gut feeling or is there more?
>
> Lots of other subsystem can allocate a lot of memory
> and they usually don't cooperate and have similar dirty limit concepts.
> So you could run out of usable memory anyways and then have a similar
> issue.
>
> For example a flood of network packets could always steal your
> GFP_ATOMIC pools very quickly in the background (gigabit or 10gig
> can transfer a lot of data very quickly) >
> Also iirc try_to_free_pages() is not completely fair and might fail
> under extreme load for some requesters.
>
> Not requiring memory allocation for any IO would be certainly safer.
>
> Anyways, it's a theoretic question because you can't sleep in
> there anyways unless something drastic changes in the driver interfaces.

Agree, that the ideal thing would be to make such changes in the driver
interfaces where in dma_map_{singe|sg} API's are not called in the
interrupt context and/or spinlock held, there by IOMMU drivers are
free to block when memory is not available. This seems to be a
noble goal invloving huge changes and testing beyond the scope of the
current IOMMU driver. I guess it would be ideal if this gets discussed
and resolved at kernel summit.

Assuming that we may have to live with the above limitations for a
while, what is the best way to allocate memory in the
dma_map_{single|sg} API's for the IOMMU drivers? (these memory
are required to setup internal IOMMU pagetables etc.)

In the first implementation of ours, we had used mempools api's to
allocate memory and we were told that mempools with GFP_ATOMIC is
useless and hence in the second implementation we came up with
resource pools ( which is preallocate pools) and again as I understand
the argument is why create another when we have slab allocation which
is similar to this resource pools.

Hence, can I assume that the conclusion of this
discussion is to use kmem_cache_alloc() functions
to allocate memory in dma_map_{single|sg} API's?

Again, if dma_map_{single|sg} API's fails due to
failure to allocate memory, the only thing that can
be done is to panic as this is what few of the other
IOMMU implementation is doing today.

Please advice.

Thanks,
Anil

2007-06-11 21:15:20

[permalink] [raw]

Subject: Re: [Intel-IOMMU 02/10] Library routine for pre-allocat pool handling

On Mon, 11 Jun 2007 13:44:42 -0700
"Keshavamurthy, Anil S" <[email protected]> wrote:

> In the first implementation of ours, we had used mempools api's to
> allocate memory and we were told that mempools with GFP_ATOMIC is
> useless and hence in the second implementation we came up with
> resource pools ( which is preallocate pools) and again as I understand
> the argument is why create another when we have slab allocation which
> is similar to this resource pools.

Odd. mempool with GFP_ATOMIC is basically equivalent to your
resource-pools, isn't it?: we'll try the slab allocator and if that failed,
fall back to the reserves.

It's missing the recharge-from-a-kernel-thread functionality but that can be
added easily enough if it's useful. It's slightly abusive of the mempool
philosophy, but it's probably better to do that than to create a new and
very-similar thing.

> Hence, can I assume that the conclusion of this
> discussion is to use kmem_cache_alloc() functions
> to allocate memory in dma_map_{single|sg} API's?
>
> Again, if dma_map_{single|sg} API's fails due to
> failure to allocate memory, the only thing that can
> be done is to panic as this is what few of the other
> IOMMU implementation is doing today.

If the only option is to panic then something's busted. If it's network IO
then there should be a way of dropping the frame. If it's disk IO then we
should report the failure and cause an IO error.

2007-06-11 21:30:12

[permalink] [raw]

Subject: Re: [Intel-IOMMU 02/10] Library routine for pre-allocat pool handling

On Mon, 11 Jun 2007, Keshavamurthy, Anil S wrote:

> Hence, can I assume that the conclusion of this
> discussion is to use kmem_cache_alloc() functions
> to allocate memory in dma_map_{single|sg} API's?

Use the page allocator for page size allocations. If you need to have
specially aligned memory in less than page sized chunks then use
kmem_cache_alloc with a specially configured slab.
>
> Again, if dma_map_{single|sg} API's fails due to
> failure to allocate memory, the only thing that can
> be done is to panic as this is what few of the other
> IOMMU implementation is doing today.

Why does it have to be so severe? The I/O operation fails right?

2007-06-11 21:43:32

by Ashok Raj

[permalink] [raw]

Subject: Re: [Intel-IOMMU 02/10] Library routine for pre-allocat pool handling

On Mon, Jun 11, 2007 at 02:14:49PM -0700, Andrew Morton wrote:
> >
> > Again, if dma_map_{single|sg} API's fails due to
> > failure to allocate memory, the only thing that can
> > be done is to panic as this is what few of the other
> > IOMMU implementation is doing today.
>
> If the only option is to panic then something's busted. If it's network IO
> then there should be a way of dropping the frame. If it's disk IO then we
> should report the failure and cause an IO error.

Just looking at the code.. appears that quite a few popular ones (or should say most) dont even look at the dma_addr_t returned to check for failure.

Thats going to be another major cleanup work.

2007-06-11 21:44:17

[permalink] [raw]

Subject: Re: [Intel-IOMMU 02/10] Library routine for pre-allocat pool handling

On Mon, Jun 11, 2007 at 02:29:56PM -0700, Christoph Lameter wrote:
> On Mon, 11 Jun 2007, Keshavamurthy, Anil S wrote:
>
> > Hence, can I assume that the conclusion of this
> > discussion is to use kmem_cache_alloc() functions
> > to allocate memory in dma_map_{single|sg} API's?
>
>
> Use the page allocator for page size allocations. If you need to have
> specially aligned memory in less than page sized chunks then use
> kmem_cache_alloc with a specially configured slab.

Okay, will do this change and get back to the list.

> >
> > Again, if dma_map_{single|sg} API's fails due to
> > failure to allocate memory, the only thing that can
> > be done is to panic as this is what few of the other
> > IOMMU implementation is doing today.
>
> Why does it have to be so severe? The I/O operation fails right?
Not sure. Also most of the callers today don;t check for failures.

-Anil

2007-06-11 22:26:19

[permalink] [raw]

Subject: Re: [Intel-IOMMU 02/10] Library routine for pre-allocat pool handling

>
> If the only option is to panic then something's busted. If it's network IO
> then there should be a way of dropping the frame. If it's disk IO then we
> should report the failure and cause an IO error.

An block IO error is basically catastrophic for the system too. There isn't really
a concept of "temporary IO error that will resolve itself" concept in Unix.

There are still lots of users of pci_map_single() that don't check the return
value unfortunately. That is mostly in old drivers; it is generally
picked on in reviews now. But then there is no guarantee that these rarely
used likely untested error handling paths actually work.

The alternative is writing out random junk which is somewhat risky.

We fixed over time all the pci_map_sg()s at least to do the checks correctly.

When I wrote the IOMMU code originally this wasn't the case and it destroyed
several file systems of test systems due to IOMMU leaks in drivers
(writing junk over the super block when the IOMMU is full doesn't make
mount happy after the next reboot)

Because of these experiences I'm more inclined towards the panic option,
although x86-64 defaults to not panic these days.

It would be really much better if sleeping was allowed, but it is hard.

-Andi

2007-06-11 22:26:32

[permalink] [raw]

Subject: Re: [Intel-IOMMU 02/10] Library routine for pre-allocat pool handling

> Please advice.

I think the short term only safe option would be to fully preallocate an aperture.
If it is too small you can try GFP_ATOMIC but it would be just
a unreliable fallback. For safety you could perhaps have some kernel thread
that tries to enlarge it in the background depending on current
use. That would be not 100% guaranteed to keep up with load,
but would at least keep up if the system is not too busy.

That is basically what your resource pools do, but they seem
to be unnecessarily convoluted for the task :- after all you
could just preallocate the page tables and rewrite/flush them without
having some kind of allocator inbetween, can't you?

If you make the start value large enough (256+MB?) that might reasonably
work. How much memory in page tables would that take? Or perhaps scale
it with available memory or available devices.

In theory it could also be precomputed from the block/network device queue
lengths etc.; the trouble is just such checks would need to be added to all kinds of
other odd subsystems that manage devices too. That would be much more work.

Some investigation how to do sleeping block/network submit would be
also interesting (e.g. replace the spinlocks there with mutexes and see how
much it affects performance). For networking you would need to keep
at least a non sleeping path though because packets can be legally
submitted from interrupt context. If it works out then sleeping
interfaces to the IOMMU code could be added.

-Andi

2007-06-11 23:19:10

[permalink] [raw]

Subject: Re: [Intel-IOMMU 02/10] Library routine for pre-allocat pool handling

On Tue, Jun 12, 2007 at 12:25:57AM +0200, Andi Kleen wrote:
>
> > Please advice.
>
> I think the short term only safe option would be to fully preallocate an aperture.
> If it is too small you can try GFP_ATOMIC but it would be just
> a unreliable fallback. For safety you could perhaps have some kernel thread
> that tries to enlarge it in the background depending on current
> use. That would be not 100% guaranteed to keep up with load,
> but would at least keep up if the system is not too busy.
>
> That is basically what your resource pools do, but they seem
> to be unnecessarily convoluted for the task :- after all you
> could just preallocate the page tables and rewrite/flush them without
> having some kind of allocator inbetween, can't you?
Nope, it is not convoluted for the task. If you see carefully how
the IO virtual address is obtained, I am basically reusing the
the previous translated virutal address once it is freed instead
of going for continious IO virtural address. Because of this
reuse of IO virtual address, these address tend to map to the
same PAGE tables and hence the memory for page tables itself
does not grow unless there is that much IO going on in the System
where all entries in the page tables are full(which means that
much IO is in flight).

The only defect I see in the current resource pool is that
I am queuing the work on Keventd to grow the pool which could
be a problem as many other subsystem in the kernel depends
on keventd and as Anderew pointed out we could dead lock.
If we have a separate worker thread to grow the pool then
the deadlock issues is taken care.

I would still love to get my current resource pool (just fix the
keventd to separate thread to grow the pool) implementations
to get into linus kernrl compared to kmem_cache_alloc implementation
as I don;t see any benifit in moving to kmem_cache_alloc. But if
people want I can provide kmem_cache_alloc implementation
too just for comparisions. But this does not solve the fundamental
problem that we have today.

So ideally for IOMMU's we should have some preallocated buffers
and if the buffers reach certain min_threshould the pool should
grow in the background and all of these features is in resource pool
implementation. Since we did not see any problems, can we
atleat try this resource pool implementation in the linux
MM kernels? If it is too bad, then I will change to
kmem_cache_alloc() version. If this testing is OKAY, then
I will refresh my patch for the coding styles etc and
resubmit with resource pool implementation. Andrew??

>
> If you make the start value large enough (256+MB?) that might reasonably
> work. How much memory in page tables would that take? Or perhaps scale
> it with available memory or available devices.

What you are suggesting is to prealloacate and setup the page tables at the
begining. But this would waste lot of memory because we don't know ahead of
time how large the page table setup should be and in future our hardware
can support 64K domains where each domain can dish out independent address
from its start to end address range. And pre-setup of tables for all of the
64K domains is not feasible.

>
> In theory it could also be precomputed from the block/network device queue
> lengths etc.; the trouble is just such checks would need to be added to all kinds of
> other odd subsystems that manage devices too. That would be much more work.
>
> Some investigation how to do sleeping block/network submit would be
> also interesting (e.g. replace the spinlocks there with mutexes and see how
> much it affects performance). For networking you would need to keep
> at least a non sleeping path though because packets can be legally
> submitted from interrupt context. If it works out then sleeping
> interfaces to the IOMMU code could be added.

Yup, these investigations needs to happen and sooner the better for all and
for general linux community.

>
> -Andi

2007-06-11 23:26:42

by Ashok Raj

[permalink] [raw]

Subject: Re: [Intel-IOMMU 02/10] Library routine for pre-allocat pool handling

2007-06-11 23:29:11

[permalink] [raw]

Subject: Re: [Intel-IOMMU 02/10] Library routine for pre-allocat pool handling

On Tue, 12 Jun 2007, Andi Kleen wrote:

> > If the only option is to panic then something's busted. If it's network IO
> > then there should be a way of dropping the frame. If it's disk IO then we
> > should report the failure and cause an IO error.
>
> An block IO error is basically catastrophic for the system too. There isn't really
> a concept of "temporary IO error that will resolve itself" concept in Unix.

In Unix? You mean the block layer cannot handle a I/O error? Not too
familiar with it but from what I can tell an I/O operation can be aborted
in the request function.

> There are still lots of users of pci_map_single() that don't check the return
> value unfortunately. That is mostly in old drivers; it is generally
> picked on in reviews now. But then there is no guarantee that these rarely
> used likely untested error handling paths actually work.

Then we need to review the code and clean these up.

2007-06-11 23:56:20

[permalink] [raw]

Subject: Re: [Intel-IOMMU 02/10] Library routine for pre-allocat pool handling

On Mon, Jun 11, 2007 at 02:14:49PM -0700, Andrew Morton wrote:
> On Mon, 11 Jun 2007 13:44:42 -0700
> "Keshavamurthy, Anil S" <[email protected]> wrote:
>
> > In the first implementation of ours, we had used mempools api's to
> > allocate memory and we were told that mempools with GFP_ATOMIC is
> > useless and hence in the second implementation we came up with
> > resource pools ( which is preallocate pools) and again as I understand
> > the argument is why create another when we have slab allocation which
> > is similar to this resource pools.
>
> Odd. mempool with GFP_ATOMIC is basically equivalent to your
> resource-pools, isn't it?: we'll try the slab allocator and if that failed,
> fall back to the reserves.

slab allocators don;t reserve the memory, in other words this memory
can be consumed by VM under memory pressure which we don;t want in
IOMMU case.

Nope,they both are exactly opposite.
mempool with GFP_ATOMIC, first tries to get memory from OS and
if that fails, it looks for the object in the pool and returns.

Where as resource pool is exactly opposite of mempool, where each
time it looks for an object in the pool and if it exist then we
return that object else we try to get the memory for OS while
scheduling the work to grow the pool objects. In fact, the work
is schedule to grow the pool when the low threshold point is hit.

Hence, I still feel resource pool implementation is best choice
for IOMMU.

-Anil

2007-06-12 00:30:28

[permalink] [raw]

Subject: Re: [Intel-IOMMU 02/10] Library routine for pre-allocat pool handling

On Mon, 11 Jun 2007 16:52:08 -0700 "Keshavamurthy, Anil S" <[email protected]> wrote:

> On Mon, Jun 11, 2007 at 02:14:49PM -0700, Andrew Morton wrote:
> > On Mon, 11 Jun 2007 13:44:42 -0700
> > "Keshavamurthy, Anil S" <[email protected]> wrote:
> >
> > > In the first implementation of ours, we had used mempools api's to
> > > allocate memory and we were told that mempools with GFP_ATOMIC is
> > > useless and hence in the second implementation we came up with
> > > resource pools ( which is preallocate pools) and again as I understand
> > > the argument is why create another when we have slab allocation which
> > > is similar to this resource pools.
> >
> > Odd. mempool with GFP_ATOMIC is basically equivalent to your
> > resource-pools, isn't it?: we'll try the slab allocator and if that failed,
> > fall back to the reserves.
>
> slab allocators don;t reserve the memory, in other words this memory
> can be consumed by VM under memory pressure which we don;t want in
> IOMMU case.
>
> Nope,they both are exactly opposite.
> mempool with GFP_ATOMIC, first tries to get memory from OS and
> if that fails, it looks for the object in the pool and returns.
>
> Where as resource pool is exactly opposite of mempool, where each
> time it looks for an object in the pool and if it exist then we
> return that object else we try to get the memory for OS while
> scheduling the work to grow the pool objects. In fact, the work
> is schedule to grow the pool when the low threshold point is hit.

I realise all that. But I'd have thought that the mempool approach is
actually better: use the page allocator and only deplete your reserve pool
when the page allocator fails.

The refill-the-pool-in-the-background feature sounds pretty worthless to
me. On a uniprocessor machine (for example), the kernel thread may not get
scheduled for tens of milliseconds (easily), which is far, far more than is
needed for that reserve pool to become fully consumed.

2007-06-12 00:39:14

[permalink] [raw]

Subject: Re: [Intel-IOMMU 02/10] Library routine for pre-allocat pool handling

On Mon, 11 Jun 2007, Keshavamurthy, Anil S wrote:

> slab allocators don;t reserve the memory, in other words this memory
> can be consumed by VM under memory pressure which we don;t want in
> IOMMU case.

So mempools....

> Nope,they both are exactly opposite.
> mempool with GFP_ATOMIC, first tries to get memory from OS and
> if that fails, it looks for the object in the pool and returns.

How does the difference matter? In both cases you get the memory you want.

> Where as resource pool is exactly opposite of mempool, where each
> time it looks for an object in the pool and if it exist then we
> return that object else we try to get the memory for OS while
> scheduling the work to grow the pool objects. In fact, the work
> is schedule to grow the pool when the low threshold point is hit.

Grow the mempool when the low level point is hit? Or equip mempools with
the functionality that you want?

2007-06-12 01:12:38

[permalink] [raw]

Subject: Re: [Intel-IOMMU 02/10] Library routine for pre-allocat pool handling

Andrew Morton wrote:
>> Where as resource pool is exactly opposite of mempool, where each
>> time it looks for an object in the pool and if it exist then we
>> return that object else we try to get the memory for OS while
>> scheduling the work to grow the pool objects. In fact, the work
>> is schedule to grow the pool when the low threshold point is hit.
>
> I realise all that. But I'd have thought that the mempool approach is
> actually better: use the page allocator and only deplete your reserve pool
> when the page allocator fails.

the problem with that is that if anything downstream from the iommu
layer ALSO needs memory, we've now eaten up the last free page and
things go splat.

in terms of deadlock avoidance... I wonder if we need something
similar to the swap token; once a process dips into the emergency
pool, it becomes the only one that gets to use this pool, so that it's
entire chain of allocations will succeed, rather than each process
only getting halfway through...

But yeah it's minute details and you can argue either way is the right
approach.

You can even argue for the old highmem.c approach; go into half the
pool before going to the VM, then to kmalloc() and if that fails dip
into the second half of the pool.

2007-06-12 01:30:50

[permalink] [raw]

Subject: Re: [Intel-IOMMU 02/10] Library routine for pre-allocat pool handling

On Mon, 11 Jun 2007, Arjan van de Ven wrote:

> the problem with that is that if anything downstream from the iommu layer ALSO
> needs memory, we've now eaten up the last free page and things go splat.

Hmmm... We need something like a reservation system right? Higher levels
in a atomic context could register their future needs. Then we can avoid
overallocating in the iommu layer.

2007-06-12 01:36:22

[permalink] [raw]

Subject: Re: [Intel-IOMMU 02/10] Library routine for pre-allocat pool handling

On Mon, 11 Jun 2007 18:10:40 -0700 Arjan van de Ven <[email protected]> wrote:

> Andrew Morton wrote:
> >> Where as resource pool is exactly opposite of mempool, where each
> >> time it looks for an object in the pool and if it exist then we
> >> return that object else we try to get the memory for OS while
> >> scheduling the work to grow the pool objects. In fact, the work
> >> is schedule to grow the pool when the low threshold point is hit.
> >
> > I realise all that. But I'd have thought that the mempool approach is
> > actually better: use the page allocator and only deplete your reserve pool
> > when the page allocator fails.
>
> the problem with that is that if anything downstream from the iommu
> layer ALSO needs memory, we've now eaten up the last free page and
> things go splat.

If that happens, we still have the mempool reserve to fall back to.

I don't see why it is better to consume the reserves before going to the
page allocator instead of holding them, err, in reserve.

2007-06-12 01:57:31

[permalink] [raw]

Subject: Re: [Intel-IOMMU 02/10] Library routine for pre-allocat pool handling

Andrew Morton wrote:
> On Mon, 11 Jun 2007 18:10:40 -0700 Arjan van de Ven <[email protected]> wrote:
>
>> Andrew Morton wrote:
>>>> Where as resource pool is exactly opposite of mempool, where each
>>>> time it looks for an object in the pool and if it exist then we
>>>> return that object else we try to get the memory for OS while
>>>> scheduling the work to grow the pool objects. In fact, the work
>>>> is schedule to grow the pool when the low threshold point is hit.
>>> I realise all that. But I'd have thought that the mempool approach is
>>> actually better: use the page allocator and only deplete your reserve pool
>>> when the page allocator fails.
>> the problem with that is that if anything downstream from the iommu
>> layer ALSO needs memory, we've now eaten up the last free page and
>> things go splat.
>
> If that happens, we still have the mempool reserve to fall back to.

we do, except that we just ate the memory the downstream code would
use and get ... so THAT can't get any.

2007-06-12 02:12:16

by Suresh Siddha

[permalink] [raw]

Subject: Re: [Intel-IOMMU 02/10] Library routine for pre-allocat pool handling

On Mon, Jun 11, 2007 at 06:55:46PM -0700, Arjan van de Ven wrote:
> Andrew Morton wrote:
> >On Mon, 11 Jun 2007 18:10:40 -0700 Arjan van de Ven
> ><[email protected]> wrote:
> >
> >>Andrew Morton wrote:
> >>>>Where as resource pool is exactly opposite of mempool, where each
> >>>>time it looks for an object in the pool and if it exist then we
> >>>>return that object else we try to get the memory for OS while
> >>>>scheduling the work to grow the pool objects. In fact, the work
> >>>>is schedule to grow the pool when the low threshold point is hit.
> >>>I realise all that. But I'd have thought that the mempool approach is
> >>>actually better: use the page allocator and only deplete your reserve
> >>>pool
> >>>when the page allocator fails.
> >>the problem with that is that if anything downstream from the iommu
> >>layer ALSO needs memory, we've now eaten up the last free page and
> >>things go splat.
> >
> >If that happens, we still have the mempool reserve to fall back to.
>
> we do, except that we just ate the memory the downstream code would
> use and get ... so THAT can't get any.

Then this problem can happen, irrespective of the changes we are
reviewing in this patch set. Is n't it?

thanks,
suresh

2007-06-13 18:40:56

by Matt Mackall

[permalink] [raw]

Subject: Re: [Intel-IOMMU 02/10] Library routine for pre-allocat pool handling

On Mon, Jun 11, 2007 at 06:55:46PM -0700, Arjan van de Ven wrote:
> Andrew Morton wrote:
> >On Mon, 11 Jun 2007 18:10:40 -0700 Arjan van de Ven
> ><[email protected]> wrote:
> >
> >>Andrew Morton wrote:
> >>>>Where as resource pool is exactly opposite of mempool, where each
> >>>>time it looks for an object in the pool and if it exist then we
> >>>>return that object else we try to get the memory for OS while
> >>>>scheduling the work to grow the pool objects. In fact, the work
> >>>>is schedule to grow the pool when the low threshold point is hit.
> >>>I realise all that. But I'd have thought that the mempool approach is
> >>>actually better: use the page allocator and only deplete your reserve
> >>>pool
> >>>when the page allocator fails.
> >>the problem with that is that if anything downstream from the iommu
> >>layer ALSO needs memory, we've now eaten up the last free page and
> >>things go splat.
> >
> >If that happens, we still have the mempool reserve to fall back to.
>
> we do, except that we just ate the memory the downstream code would
> use and get ... so THAT can't get any.

Then the downstream ought to be using a mempool?

--
Mathematics is the supreme nostalgia of our time.

2007-06-13 19:04:32