2012-10-01 14:03:21

by Alexander Graf

[permalink] [raw]
Subject: Re: [REGRESSION] nfsd crashing with 3.6.0-rc7 on PowerPC


On 28.09.2012, at 17:10, J. Bruce Fields wrote:

> On Fri, Sep 28, 2012 at 04:19:55AM +0200, Alexander Graf wrote:
>>
>> On 28.09.2012, at 04:04, Linus Torvalds wrote:
>>
>>> On Thu, Sep 27, 2012 at 6:55 PM, Alexander Graf <[email protected]> wrote:
>>>>
>>>> Below are OOPS excerpts from different rc's I tried. All of them crashed - all the way up to current Linus' master branch. I haven't cross-checked, but I don't remember any such behavior from pre-3.6 releases.
>>>
>>> Since you seem to be able to reproduce it easily (and apparently
>>> reliably), any chance you could just bisect it?
>>>
>>> Since I assume v3.5 is fine, and apparently -rc1 is already busted, a simple
>>>
>>> git bisect start
>>> git bisect good v3.5
>>> git bisect bad v3.6-rc1
>>>
>>> will get you started on your adventure..
>>
>> Heh, will give it a try :). The thing really does look quite bisectable.
>>
>>
>> It might take a few hours though - the machine isn't exactly fast by today's standards and it's getting late here. But I'll keep you updated.
>
> I doubt it's anything special about that workload, but just for kicks I
> tried a "git clone -ls" (cloning my linux tree to another directory on
> the same nfs filesystem), with server on 3.6.0-rc7, and didn't see
> anything interesting (just an xfs lockdep warning that looks like this
> one jlayton already reported:
> http://oss.sgi.com/archives/xfs/2012-09/msg00088.html
> )
>
> Any (even partial) bisection results would certainly be useful, thanks.

Phew. Here we go :). It looks to be more of a PPC specific problem than it appeared as at first:


b4c3a8729ae57b4f84d661e16a192f828eca1d03 is first bad commit
commit b4c3a8729ae57b4f84d661e16a192f828eca1d03
Author: Anton Blanchard <[email protected]>
Date: Thu Jun 7 18:14:48 2012 +0000

powerpc/iommu: Implement IOMMU pools to improve multiqueue adapter performance

At the moment all queues in a multiqueue adapter will serialise
against the IOMMU table lock. This is proving to be a big issue,
especially with 10Gbit ethernet.

This patch creates 4 pools and tries to spread the load across
them. If the table is under 1GB in size we revert back to the
original behaviour of 1 pool and 1 largealloc pool.

We create a hash to map CPUs to pools. Since we prefer interrupts to
be affinitised to primary CPUs, without some form of hashing we are
very likely to end up using the same pool. As an example, POWER7
has 4 way SMT and with 4 pools all primary threads will map to the
same pool.

The largealloc pool is reduced from 1/2 to 1/4 of the space to
partially offset the overhead of breaking the table up into pools.

Some performance numbers were obtained with a Chelsio T3 adapter on
two POWER7 boxes, running a 100 session TCP round robin test.

Performance improved 69% with this patch applied.

Signed-off-by: Anton Blanchard <[email protected]>
Signed-off-by: Benjamin Herrenschmidt <[email protected]>

:040000 040000 039ae3cbdcfded9c6b13e58a3fc67609f1b587b0 6755a8c4a690cc80dcf834d1127f21db925476d6 M arch


Alex


2012-10-01 15:21:31

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [REGRESSION] nfsd crashing with 3.6.0-rc7 on PowerPC

On Mon, Oct 01, 2012 at 04:03:06PM +0200, Alexander Graf wrote:
>
> On 28.09.2012, at 17:10, J. Bruce Fields wrote:
>
> > On Fri, Sep 28, 2012 at 04:19:55AM +0200, Alexander Graf wrote:
> >>
> >> On 28.09.2012, at 04:04, Linus Torvalds wrote:
> >>
> >>> On Thu, Sep 27, 2012 at 6:55 PM, Alexander Graf <[email protected]> wrote:
> >>>>
> >>>> Below are OOPS excerpts from different rc's I tried. All of them crashed - all the way up to current Linus' master branch. I haven't cross-checked, but I don't remember any such behavior from pre-3.6 releases.
> >>>
> >>> Since you seem to be able to reproduce it easily (and apparently
> >>> reliably), any chance you could just bisect it?
> >>>
> >>> Since I assume v3.5 is fine, and apparently -rc1 is already busted, a simple
> >>>
> >>> git bisect start
> >>> git bisect good v3.5
> >>> git bisect bad v3.6-rc1
> >>>
> >>> will get you started on your adventure..
> >>
> >> Heh, will give it a try :). The thing really does look quite bisectable.
> >>
> >>
> >> It might take a few hours though - the machine isn't exactly fast by today's standards and it's getting late here. But I'll keep you updated.
> >
> > I doubt it's anything special about that workload, but just for kicks I
> > tried a "git clone -ls" (cloning my linux tree to another directory on
> > the same nfs filesystem), with server on 3.6.0-rc7, and didn't see
> > anything interesting (just an xfs lockdep warning that looks like this
> > one jlayton already reported:
> > http://oss.sgi.com/archives/xfs/2012-09/msg00088.html
> > )
> >
> > Any (even partial) bisection results would certainly be useful, thanks.
>
> Phew. Here we go :). It looks to be more of a PPC specific problem than it appeared as at first:

Yep, thanks--I'll assume this is somebody else's problem until somebody
tells me otherwise!

--b.

>
>
> b4c3a8729ae57b4f84d661e16a192f828eca1d03 is first bad commit
> commit b4c3a8729ae57b4f84d661e16a192f828eca1d03
> Author: Anton Blanchard <[email protected]>
> Date: Thu Jun 7 18:14:48 2012 +0000
>
> powerpc/iommu: Implement IOMMU pools to improve multiqueue adapter performance
>
> At the moment all queues in a multiqueue adapter will serialise
> against the IOMMU table lock. This is proving to be a big issue,
> especially with 10Gbit ethernet.
>
> This patch creates 4 pools and tries to spread the load across
> them. If the table is under 1GB in size we revert back to the
> original behaviour of 1 pool and 1 largealloc pool.
>
> We create a hash to map CPUs to pools. Since we prefer interrupts to
> be affinitised to primary CPUs, without some form of hashing we are
> very likely to end up using the same pool. As an example, POWER7
> has 4 way SMT and with 4 pools all primary threads will map to the
> same pool.
>
> The largealloc pool is reduced from 1/2 to 1/4 of the space to
> partially offset the overhead of breaking the table up into pools.
>
> Some performance numbers were obtained with a Chelsio T3 adapter on
> two POWER7 boxes, running a 100 session TCP round robin test.
>
> Performance improved 69% with this patch applied.
>
> Signed-off-by: Anton Blanchard <[email protected]>
> Signed-off-by: Benjamin Herrenschmidt <[email protected]>
>
> :040000 040000 039ae3cbdcfded9c6b13e58a3fc67609f1b587b0 6755a8c4a690cc80dcf834d1127f21db925476d6 M arch
>
>
> Alex
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2012-10-02 00:59:59

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [REGRESSION] nfsd crashing with 3.6.0-rc7 on PowerPC

On Mon, 2012-10-01 at 16:03 +0200, Alexander Graf wrote:
> Phew. Here we go :). It looks to be more of a PPC specific problem than it appeared as at first:

Ok, so I suspect the problem is the pushing down of the locks which
breaks with iommu backends that have a separate flush callback. In
that case, the flush moves out of the allocator lock.

Now we do call flush before we return, still, but it becomes racy
I suspect, but somebody needs to give it a closer look. I'm hoping
Anton or Nish will later today.

Cheers,
Ben.

>
> b4c3a8729ae57b4f84d661e16a192f828eca1d03 is first bad commit
> commit b4c3a8729ae57b4f84d661e16a192f828eca1d03
> Author: Anton Blanchard <[email protected]>
> Date: Thu Jun 7 18:14:48 2012 +0000
>
> powerpc/iommu: Implement IOMMU pools to improve multiqueue adapter performance
>
> At the moment all queues in a multiqueue adapter will serialise
> against the IOMMU table lock. This is proving to be a big issue,
> especially with 10Gbit ethernet.
>
> This patch creates 4 pools and tries to spread the load across
> them. If the table is under 1GB in size we revert back to the
> original behaviour of 1 pool and 1 largealloc pool.
>
> We create a hash to map CPUs to pools. Since we prefer interrupts to
> be affinitised to primary CPUs, without some form of hashing we are
> very likely to end up using the same pool. As an example, POWER7
> has 4 way SMT and with 4 pools all primary threads will map to the
> same pool.
>
> The largealloc pool is reduced from 1/2 to 1/4 of the space to
> partially offset the overhead of breaking the table up into pools.
>
> Some performance numbers were obtained with a Chelsio T3 adapter on
> two POWER7 boxes, running a 100 session TCP round robin test.
>
> Performance improved 69% with this patch applied.
>
> Signed-off-by: Anton Blanchard <[email protected]>
> Signed-off-by: Benjamin Herrenschmidt <[email protected]>
>
> :040000 040000 039ae3cbdcfded9c6b13e58a3fc67609f1b587b0 6755a8c4a690cc80dcf834d1127f21db925476d6 M arch
>
>
> Alex

2012-10-02 21:43:41

by Nishanth Aravamudan

[permalink] [raw]
Subject: Re: [REGRESSION] nfsd crashing with 3.6.0-rc7 on PowerPC

Hi Ben,

On 02.10.2012 [10:58:29 +1000], Benjamin Herrenschmidt wrote:
> On Mon, 2012-10-01 at 16:03 +0200, Alexander Graf wrote:
> > Phew. Here we go :). It looks to be more of a PPC specific problem
> > than it appeared as at first:
>
> Ok, so I suspect the problem is the pushing down of the locks which
> breaks with iommu backends that have a separate flush callback. In
> that case, the flush moves out of the allocator lock.
>
> Now we do call flush before we return, still, but it becomes racy
> I suspect, but somebody needs to give it a closer look. I'm hoping
> Anton or Nish will later today.

Started looking into this. If your suspicion were accurate, wouldn't the
bisection have stopped at 0e4bc95d87394364f408627067238453830bdbf3
("powerpc/iommu: Reduce spinlock coverage in iommu_alloc and
iommu_free")?

Alex, the error is reproducible, right? Does it go away by reverting
that commit against mainline? Just trying to narrow down my focus.

Thanks,
Nish

2012-10-02 21:47:54

by Alexander Graf

[permalink] [raw]
Subject: Re: [REGRESSION] nfsd crashing with 3.6.0-rc7 on PowerPC


On 02.10.2012, at 23:43, Nishanth Aravamudan wrote:

> Hi Ben,
>
> On 02.10.2012 [10:58:29 +1000], Benjamin Herrenschmidt wrote:
>> On Mon, 2012-10-01 at 16:03 +0200, Alexander Graf wrote:
>>> Phew. Here we go :). It looks to be more of a PPC specific problem
>>> than it appeared as at first:
>>
>> Ok, so I suspect the problem is the pushing down of the locks which
>> breaks with iommu backends that have a separate flush callback. In
>> that case, the flush moves out of the allocator lock.
>>
>> Now we do call flush before we return, still, but it becomes racy
>> I suspect, but somebody needs to give it a closer look. I'm hoping
>> Anton or Nish will later today.
>
> Started looking into this. If your suspicion were accurate, wouldn't the
> bisection have stopped at 0e4bc95d87394364f408627067238453830bdbf3
> ("powerpc/iommu: Reduce spinlock coverage in iommu_alloc and
> iommu_free")?
>
> Alex, the error is reproducible, right?

Yes. I'm having a hard time to figure out if the reason my U4 based G5 Mac crashes and fails reading data is the same since I don't have a serial connection there, but I assume so.

> Does it go away by reverting
> that commit against mainline? Just trying to narrow down my focus.

The patch doesn't revert that easily. Mind to provide a revert patch so I can try?


Alex

2012-10-02 22:17:50

by Nishanth Aravamudan

[permalink] [raw]
Subject: Re: [REGRESSION] nfsd crashing with 3.6.0-rc7 on PowerPC

On 02.10.2012 [23:47:39 +0200], Alexander Graf wrote:
>
> On 02.10.2012, at 23:43, Nishanth Aravamudan wrote:
>
> > Hi Ben,
> >
> > On 02.10.2012 [10:58:29 +1000], Benjamin Herrenschmidt wrote:
> >> On Mon, 2012-10-01 at 16:03 +0200, Alexander Graf wrote:
> >>> Phew. Here we go :). It looks to be more of a PPC specific problem
> >>> than it appeared as at first:
> >>
> >> Ok, so I suspect the problem is the pushing down of the locks which
> >> breaks with iommu backends that have a separate flush callback. In
> >> that case, the flush moves out of the allocator lock.
> >>
> >> Now we do call flush before we return, still, but it becomes racy
> >> I suspect, but somebody needs to give it a closer look. I'm hoping
> >> Anton or Nish will later today.
> >
> > Started looking into this. If your suspicion were accurate, wouldn't the
> > bisection have stopped at 0e4bc95d87394364f408627067238453830bdbf3
> > ("powerpc/iommu: Reduce spinlock coverage in iommu_alloc and
> > iommu_free")?
> >
> > Alex, the error is reproducible, right?
>
> Yes. I'm having a hard time to figure out if the reason my U4 based G5
> Mac crashes and fails reading data is the same since I don't have a
> serial connection there, but I assume so.

Ok, great, thanks. Yeah, that would imply (I think) that the I would
have thought the lock pushdown in the above commit (or even in one of
the others in Anton's series) would have been the real source if it was
a lock-based race. But that's just my first sniff at what Ben was
suggesting. Still reading/understanding the code.

> > Does it go away by reverting
> > that commit against mainline? Just trying to narrow down my focus.
>
> The patch doesn't revert that easily. Mind to provide a revert patch
> so I can try?

The following at least builds on defconfig here:

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index cbfe678..957a83f 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -53,16 +53,6 @@ static __inline__ __attribute_const__ int get_iommu_order(unsigned long size)
*/
#define IOMAP_MAX_ORDER 13

-#define IOMMU_POOL_HASHBITS 2
-#define IOMMU_NR_POOLS (1 << IOMMU_POOL_HASHBITS)
-
-struct iommu_pool {
- unsigned long start;
- unsigned long end;
- unsigned long hint;
- spinlock_t lock;
-} ____cacheline_aligned_in_smp;
-
struct iommu_table {
unsigned long it_busno; /* Bus number this table belongs to */
unsigned long it_size; /* Size of iommu table in entries */
@@ -71,10 +61,10 @@ struct iommu_table {
unsigned long it_index; /* which iommu table this is */
unsigned long it_type; /* type: PCI or Virtual Bus */
unsigned long it_blocksize; /* Entries in each block (cacheline) */
- unsigned long poolsize;
- unsigned long nr_pools;
- struct iommu_pool large_pool;
- struct iommu_pool pools[IOMMU_NR_POOLS];
+ unsigned long it_hint; /* Hint for next alloc */
+ unsigned long it_largehint; /* Hint for large allocs */
+ unsigned long it_halfpoint; /* Breaking point for small/large allocs */
+ spinlock_t it_lock; /* Protects it_map */
unsigned long *it_map; /* A simple allocation bitmap for now */
};

diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index ff5a6ce..9a31f3c 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -62,26 +62,6 @@ static int __init setup_iommu(char *str)

__setup("iommu=", setup_iommu);

-static DEFINE_PER_CPU(unsigned int, iommu_pool_hash);
-
-/*
- * We precalculate the hash to avoid doing it on every allocation.
- *
- * The hash is important to spread CPUs across all the pools. For example,
- * on a POWER7 with 4 way SMT we want interrupts on the primary threads and
- * with 4 pools all primary threads would map to the same pool.
- */
-static int __init setup_iommu_pool_hash(void)
-{
- unsigned int i;
-
- for_each_possible_cpu(i)
- per_cpu(iommu_pool_hash, i) = hash_32(i, IOMMU_POOL_HASHBITS);
-
- return 0;
-}
-subsys_initcall(setup_iommu_pool_hash);
-
#ifdef CONFIG_FAIL_IOMMU

static DECLARE_FAULT_ATTR(fail_iommu);
@@ -184,8 +164,6 @@ static unsigned long iommu_range_alloc(struct device *dev,
unsigned long align_mask;
unsigned long boundary_size;
unsigned long flags;
- unsigned int pool_nr;
- struct iommu_pool *pool;

align_mask = 0xffffffffffffffffl >> (64 - align_order);

@@ -201,46 +179,38 @@ static unsigned long iommu_range_alloc(struct device *dev,
if (should_fail_iommu(dev))
return DMA_ERROR_CODE;

- /*
- * We don't need to disable preemption here because any CPU can
- * safely use any IOMMU pool.
- */
- pool_nr = __raw_get_cpu_var(iommu_pool_hash) & (tbl->nr_pools - 1);
-
- if (largealloc)
- pool = &(tbl->large_pool);
- else
- pool = &(tbl->pools[pool_nr]);
-
- spin_lock_irqsave(&(pool->lock), flags);
+ spin_lock_irqsave(&(tbl->it_lock), flags);

-again:
- if ((pass == 0) && handle && *handle)
+ if (handle && *handle)
start = *handle;
else
- start = pool->hint;
+ start = largealloc ? tbl->it_largehint : tbl->it_hint;

- limit = pool->end;
+ /* Use only half of the table for small allocs (15 pages or less) */
+ limit = largealloc ? tbl->it_size : tbl->it_halfpoint;
+
+ if (largealloc && start < tbl->it_halfpoint)
+ start = tbl->it_halfpoint;

/* The case below can happen if we have a small segment appended
* to a large, or when the previous alloc was at the very end of
* the available space. If so, go back to the initial start.
*/
if (start >= limit)
- start = pool->start;
+ start = largealloc ? tbl->it_largehint : tbl->it_hint;
+
+ again:

if (limit + tbl->it_offset > mask) {
limit = mask - tbl->it_offset + 1;
/* If we're constrained on address range, first try
* at the masked hint to avoid O(n) search complexity,
- * but on second pass, start at 0 in pool 0.
+ * but on second pass, start at 0.
*/
- if ((start & mask) >= limit || pass > 0) {
- pool = &(tbl->pools[0]);
- start = pool->start;
- } else {
+ if ((start & mask) >= limit || pass > 0)
+ start = 0;
+ else
start &= mask;
- }
}

if (dev)
@@ -254,25 +224,17 @@ again:
tbl->it_offset, boundary_size >> IOMMU_PAGE_SHIFT,
align_mask);
if (n == -1) {
- if (likely(pass == 0)) {
- /* First try the pool from the start */
- pool->hint = pool->start;
- pass++;
- goto again;
-
- } else if (pass <= tbl->nr_pools) {
- /* Now try scanning all the other pools */
- spin_unlock(&(pool->lock));
- pool_nr = (pool_nr + 1) & (tbl->nr_pools - 1);
- pool = &tbl->pools[pool_nr];
- spin_lock(&(pool->lock));
- pool->hint = pool->start;
+ if (likely(pass < 2)) {
+ /* First failure, just rescan the half of the table.
+ * Second failure, rescan the other half of the table.
+ */
+ start = (largealloc ^ pass) ? tbl->it_halfpoint : 0;
+ limit = pass ? tbl->it_size : limit;
pass++;
goto again;
-
} else {
- /* Give up */
- spin_unlock_irqrestore(&(pool->lock), flags);
+ /* Third failure, give up */
+ spin_unlock_irqrestore(&(tbl->it_lock), flags);
return DMA_ERROR_CODE;
}
}
@@ -282,10 +244,10 @@ again:
/* Bump the hint to a new block for small allocs. */
if (largealloc) {
/* Don't bump to new block to avoid fragmentation */
- pool->hint = end;
+ tbl->it_largehint = end;
} else {
/* Overflow will be taken care of at the next allocation */
- pool->hint = (end + tbl->it_blocksize - 1) &
+ tbl->it_hint = (end + tbl->it_blocksize - 1) &
~(tbl->it_blocksize - 1);
}

@@ -293,8 +255,7 @@ again:
if (handle)
*handle = end;

- spin_unlock_irqrestore(&(pool->lock), flags);
-
+ spin_unlock_irqrestore(&(tbl->it_lock), flags);
return n;
}

@@ -369,45 +330,23 @@ static bool iommu_free_check(struct iommu_table *tbl, dma_addr_t dma_addr,
return true;
}

-static struct iommu_pool *get_pool(struct iommu_table *tbl,
- unsigned long entry)
-{
- struct iommu_pool *p;
- unsigned long largepool_start = tbl->large_pool.start;
-
- /* The large pool is the last pool at the top of the table */
- if (entry >= largepool_start) {
- p = &tbl->large_pool;
- } else {
- unsigned int pool_nr = entry / tbl->poolsize;
-
- BUG_ON(pool_nr > tbl->nr_pools);
- p = &tbl->pools[pool_nr];
- }
-
- return p;
-}
-
static void __iommu_free(struct iommu_table *tbl, dma_addr_t dma_addr,
unsigned int npages)
{
unsigned long entry, free_entry;
unsigned long flags;
- struct iommu_pool *pool;

entry = dma_addr >> IOMMU_PAGE_SHIFT;
free_entry = entry - tbl->it_offset;

- pool = get_pool(tbl, free_entry);
-
if (!iommu_free_check(tbl, dma_addr, npages))
return;

ppc_md.tce_free(tbl, entry, npages);

- spin_lock_irqsave(&(pool->lock), flags);
+ spin_lock_irqsave(&(tbl->it_lock), flags);
bitmap_clear(tbl->it_map, free_entry, npages);
- spin_unlock_irqrestore(&(pool->lock), flags);
+ spin_unlock_irqrestore(&(tbl->it_lock), flags);
}

static void iommu_free(struct iommu_table *tbl, dma_addr_t dma_addr,
@@ -649,8 +588,9 @@ struct iommu_table *iommu_init_table(struct iommu_table *tbl, int nid)
unsigned long sz;
static int welcomed = 0;
struct page *page;
- unsigned int i;
- struct iommu_pool *p;
+
+ /* Set aside 1/4 of the table for large allocations. */
+ tbl->it_halfpoint = tbl->it_size * 3 / 4;

/* number of bytes needed for the bitmap */
sz = (tbl->it_size + 7) >> 3;
@@ -669,28 +609,9 @@ struct iommu_table *iommu_init_table(struct iommu_table *tbl, int nid)
if (tbl->it_offset == 0)
set_bit(0, tbl->it_map);

- /* We only split the IOMMU table if we have 1GB or more of space */
- if ((tbl->it_size << IOMMU_PAGE_SHIFT) >= (1UL * 1024 * 1024 * 1024))
- tbl->nr_pools = IOMMU_NR_POOLS;
- else
- tbl->nr_pools = 1;
-
- /* We reserve the top 1/4 of the table for large allocations */
- tbl->poolsize = (tbl->it_size * 3 / 4) / tbl->nr_pools;
-
- for (i = 0; i < tbl->nr_pools; i++) {
- p = &tbl->pools[i];
- spin_lock_init(&(p->lock));
- p->start = tbl->poolsize * i;
- p->hint = p->start;
- p->end = p->start + tbl->poolsize;
- }
-
- p = &tbl->large_pool;
- spin_lock_init(&(p->lock));
- p->start = tbl->poolsize * i;
- p->hint = p->start;
- p->end = tbl->it_size;
+ tbl->it_hint = 0;
+ tbl->it_largehint = tbl->it_halfpoint;
+ spin_lock_init(&tbl->it_lock);

iommu_table_clear(tbl);

diff --git a/arch/powerpc/platforms/cell/iommu.c b/arch/powerpc/platforms/cell/iommu.c
index dca2136..b673200 100644
--- a/arch/powerpc/platforms/cell/iommu.c
+++ b/arch/powerpc/platforms/cell/iommu.c
@@ -518,6 +518,7 @@ cell_iommu_setup_window(struct cbe_iommu *iommu, struct device_node *np,
__set_bit(0, window->table.it_map);
tce_build_cell(&window->table, window->table.it_offset, 1,
(unsigned long)iommu->pad_page, DMA_TO_DEVICE, NULL);
+ window->table.it_hint = window->table.it_blocksize;

return window;
}

2012-10-02 22:31:18

by Alexander Graf

[permalink] [raw]
Subject: Re: [REGRESSION] nfsd crashing with 3.6.0-rc7 on PowerPC


On 03.10.2012, at 00:17, Nishanth Aravamudan wrote:

> On 02.10.2012 [23:47:39 +0200], Alexander Graf wrote:
>>
>> On 02.10.2012, at 23:43, Nishanth Aravamudan wrote:
>>
>>> Hi Ben,
>>>
>>> On 02.10.2012 [10:58:29 +1000], Benjamin Herrenschmidt wrote:
>>>> On Mon, 2012-10-01 at 16:03 +0200, Alexander Graf wrote:
>>>>> Phew. Here we go :). It looks to be more of a PPC specific problem
>>>>> than it appeared as at first:
>>>>
>>>> Ok, so I suspect the problem is the pushing down of the locks which
>>>> breaks with iommu backends that have a separate flush callback. In
>>>> that case, the flush moves out of the allocator lock.
>>>>
>>>> Now we do call flush before we return, still, but it becomes racy
>>>> I suspect, but somebody needs to give it a closer look. I'm hoping
>>>> Anton or Nish will later today.
>>>
>>> Started looking into this. If your suspicion were accurate, wouldn't the
>>> bisection have stopped at 0e4bc95d87394364f408627067238453830bdbf3
>>> ("powerpc/iommu: Reduce spinlock coverage in iommu_alloc and
>>> iommu_free")?
>>>
>>> Alex, the error is reproducible, right?
>>
>> Yes. I'm having a hard time to figure out if the reason my U4 based G5
>> Mac crashes and fails reading data is the same since I don't have a
>> serial connection there, but I assume so.
>
> Ok, great, thanks. Yeah, that would imply (I think) that the I would
> have thought the lock pushdown in the above commit (or even in one of
> the others in Anton's series) would have been the real source if it was
> a lock-based race. But that's just my first sniff at what Ben was
> suggesting. Still reading/understanding the code.
>
>>> Does it go away by reverting
>>> that commit against mainline? Just trying to narrow down my focus.
>>
>> The patch doesn't revert that easily. Mind to provide a revert patch
>> so I can try?
>
> The following at least builds on defconfig here:

Yes. With that patch applied, things work for me again.


Alex

2012-10-03 04:23:07

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [REGRESSION] nfsd crashing with 3.6.0-rc7 on PowerPC

On Tue, 2012-10-02 at 14:43 -0700, Nishanth Aravamudan wrote:
>
> Started looking into this. If your suspicion were accurate, wouldn't
> the
> bisection have stopped at 0e4bc95d87394364f408627067238453830bdbf3
> ("powerpc/iommu: Reduce spinlock coverage in iommu_alloc and
> iommu_free")?
>
> Alex, the error is reproducible, right? Does it go away by reverting
> that commit against mainline? Just trying to narrow down my focus.

My suspiction is, I'm afraid, a real bug but not that bug since it would
only happen on U3 and this is an U4 machine ... so we have two bugs, one
of them still unidentified.

Cleaning up the CC list for now...

Cheers,
Ben.

2012-10-04 00:26:15

by Anton Blanchard

[permalink] [raw]
Subject: Re: [REGRESSION] nfsd crashing with 3.6.0-rc7 on PowerPC


Hi,

> Yes. With that patch applied, things work for me again.

Thanks Alex, Nish. We can reproduce this on one of our Biminis, looking
into it now.

Anton

2012-10-04 04:57:24

by Anton Blanchard

[permalink] [raw]
Subject: [PATCH] powerpc/iommu: Fix multiple issues with IOMMU pools code


Hi Alex,

Looks to be a preempt issue with the iommu pools code. I did find a
couple more bugs along the way too.

Anton
--

There are a number of issues in the recent IOMMU pools code:

- On a preempt kernel we might switch CPUs in the middle of building
a scatter gather list. When this happens the handle hint passed in
no longer falls within the local CPU's pool. Check for this and
fall back to the pool hint.

- We were missing a spin_unlock/spin_lock in one spot where we
switch pools.

- We need to provide locking around dart_tlb_invalidate_all and
dart_tlb_invalidate_one now that the global lock is gone.

Reported-by: Alexander Graf <[email protected]>
Signed-off-by: Anton Blanchard <[email protected]>
---

There is still an issue with the lazy u3 flushing, but I wanted
to get this out for testing.

Index: b/arch/powerpc/kernel/iommu.c
===================================================================
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -215,7 +215,8 @@ static unsigned long iommu_range_alloc(s
spin_lock_irqsave(&(pool->lock), flags);

again:
- if ((pass == 0) && handle && *handle)
+ if ((pass == 0) && handle && *handle &&
+ (*handle >= pool->start) && (*handle < pool->end))
start = *handle;
else
start = pool->hint;
@@ -236,7 +237,9 @@ again:
* but on second pass, start at 0 in pool 0.
*/
if ((start & mask) >= limit || pass > 0) {
+ spin_unlock(&(pool->lock));
pool = &(tbl->pools[0]);
+ spin_lock(&(pool->lock));
start = pool->start;
} else {
start &= mask;
Index: b/arch/powerpc/sysdev/dart_iommu.c
===================================================================
--- a/arch/powerpc/sysdev/dart_iommu.c
+++ b/arch/powerpc/sysdev/dart_iommu.c
@@ -74,11 +74,16 @@ static int dart_is_u4;

#define DBG(...)

+static DEFINE_SPINLOCK(invalidate_lock);
+
static inline void dart_tlb_invalidate_all(void)
{
unsigned long l = 0;
unsigned int reg, inv_bit;
unsigned long limit;
+ unsigned long flags;
+
+ spin_lock_irqsave(&invalidate_lock, flags);

DBG("dart: flush\n");

@@ -111,12 +116,17 @@ retry:
panic("DART: TLB did not flush after waiting a long "
"time. Buggy U3 ?");
}
+
+ spin_unlock_irqrestore(&invalidate_lock, flags);
}

static inline void dart_tlb_invalidate_one(unsigned long bus_rpn)
{
unsigned int reg;
unsigned int l, limit;
+ unsigned long flags;
+
+ spin_lock_irqsave(&invalidate_lock, flags);

reg = DART_CNTL_U4_ENABLE | DART_CNTL_U4_IONE |
(bus_rpn & DART_CNTL_U4_IONE_MASK);
@@ -138,6 +148,8 @@ wait_more:
panic("DART: TLB did not flush after waiting a long "
"time. Buggy U4 ?");
}
+
+ spin_unlock_irqrestore(&invalidate_lock, flags);
}

static void dart_flush(struct iommu_table *tbl)

2012-10-04 10:54:59

by Alexander Graf

[permalink] [raw]
Subject: Re: [PATCH] powerpc/iommu: Fix multiple issues with IOMMU pools code

Hi Anton,

On 04.10.2012, at 06:57, Anton Blanchard wrote:

>
> Hi Alex,
>
> Looks to be a preempt issue with the iommu pools code. I did find a
> couple more bugs along the way too.
>
> Anton
> --
>
> There are a number of issues in the recent IOMMU pools code:
>
> - On a preempt kernel we might switch CPUs in the middle of building
> a scatter gather list. When this happens the handle hint passed in
> no longer falls within the local CPU's pool. Check for this and
> fall back to the pool hint.
>
> - We were missing a spin_unlock/spin_lock in one spot where we
> switch pools.
>
> - We need to provide locking around dart_tlb_invalidate_all and
> dart_tlb_invalidate_one now that the global lock is gone.
>
> Reported-by: Alexander Graf <[email protected]>
> Signed-off-by: Anton Blanchard <[email protected]>
> ---
>
> There is still an issue with the lazy u3 flushing, but I wanted
> to get this out for testing.

Yup. It fixes the nfs problem on my U4 based machine.

Tested-by: Alexander Graf <[email protected]>

Alex