2006-01-27 16:57:54

by Mark Haverkamp

[permalink] [raw]
Subject: iommu_alloc failure and panic

While running a disk test (bonnie++) I have been seeing iommu_alloc
failed messages in the syslog leading to a panic. The machine that I
have is an IBM openserver 720. It has 7856808 kB memory and two dual
core power5 cpus.

cpu : POWER5 (gr)
clock : 1656.384000MHz
revision : 2.1 (pvr 003a 0201)

I have been testing large numbers of storage devices. I have 16000 scsi
LUNs configured. (4000 fiberchannel LUNS seen 4 times). They are
configured as 4000 multipath devices using device mapper. I have 100 of
those devices configured as a logical volume using LVM2. Once I start
bonnie++ using one of those logical volumes I start seeing iommu_alloc
failures and then a panic. I don't have this issue when accessing the
individual devices or the individual multipath devices. Only when
conglomerated into a logical volume.


Here is the syslog output:

Jan 26 14:56:41 linux kernel: iommu_alloc failed, tbl c00000000263c480 vaddr c0000000d7967000 npages 10
Jan 26 14:56:41 linux kernel: iommu_alloc failed, tbl c00000000263c480 vaddr c00000011ad80000 npages 10
Jan 26 14:56:41 linux kernel: iommu_alloc failed, tbl c00000000263c480 vaddr c0000000e3bca000 npages 10
Jan 26 14:56:41 linux kernel: iommu_alloc failed, tbl c00000000263c480 vaddr c00000011ad80000 npages 10
Jan 26 14:56:41 linux kernel: iommu_alloc failed, tbl c00000000263c480 vaddr c00000011733a000 npages 10
Jan 26 14:56:41 linux kernel: iommu_alloc failed, tbl c00000000263c480 vaddr c000000117c58000 npages 10
Jan 26 14:56:41 linux kernel: iommu_alloc failed, tbl c00000000263c480 vaddr c00000011711d000 npages 10
Jan 26 14:56:41 linux kernel: iommu_alloc failed, tbl c00000000263c480 vaddr c000000126292000 npages 10
Jan 26 14:56:41 linux kernel: iommu_alloc failed, tbl c00000000263c480 vaddr c00000011ad80000 npages 10
Jan 26 14:56:41 linux kernel: iommu_alloc failed, tbl c00000000263c480 vaddr c0000000d4728000 npages 10
Jan 26 14:56:46 linux kernel: printk: 830 messages suppressed.
Jan 26 14:56:46 linux kernel: iommu_alloc failed, tbl c00000000263c480 vaddr c00000012adaa000 npages 10
Jan 26 14:56:51 linux kernel: printk: 818 messages suppressed.
Jan 26 14:56:51 linux kernel: iommu_alloc failed, tbl c00000000263c480 vaddr c0000001451eb000 npages 10
Jan 26 14:56:56 linux kernel: printk: 1027 messages suppressed.
Jan 26 14:56:56 linux kernel: iommu_alloc failed, tbl c00000000263c480 vaddr c00000009c365000 npages 10
Jan 26 14:57:01 linux kernel: printk: 810 messages suppressed.
Jan 26 14:57:01 linux kernel: iommu_alloc failed, tbl c00000000263c480 vaddr c000000152606000 npages 10
Jan 26 14:57:06 linux kernel: printk: 819 messages suppressed.
Jan 26 14:57:06 linux kernel: iommu_alloc failed, tbl c00000000263c480 vaddr c000000134f3a000 npages 10
Jan 26 14:57:11 linux kernel: printk: 679 messages suppressed.
Jan 26 14:57:11 linux kernel: iommu_alloc failed, tbl c00000000263c480 vaddr c000000167030000 npages 10

Here is the panic output cut from the console:

Welcome to SUSE LINUX 10.0 (PPC) - Kernel 2.6.16-rc1-ppc64 (hvc0).


dev2-002 login: Oops: Kernel access of bad area, sig: 11 [#1]
SMP NR_CPUS=128 NUMA PSERIES LPAR
Modules linked in: dm_round_robin lpfc scsi_transport_fc dm_multipath ipv6 parport_pc lp parport dm_mod sg st ipr firmware_class sd_mod scsi_mod
NIP: C00000000000F7D0 LR: C00000000000FA2C CTR: 0000000000000000
REGS: c0000000076db940 TRAP: 0300 Not tainted (2.6.16-rc1-ppc64)
MSR: 8000000000009032 <EE,ME,IR,DR> CR: 24002048 XER: 00000010
DAR: C000000600711710, DSISR: 0000000040010000
TASK = c0000000076e4040[191] 'kblockd/3' THREAD: c0000000076d8000 CPU: 3
GPR00: 0000000600000000 C0000000076DBBC0 C00000000070EE68 C0000000076DBDE0
GPR04: C0000000076E4040 00000000000002F0 0000000000080000 C00000000058C720
GPR08: 0000000000000001 C0000000076DBD10 C000000000711710 C0000000076D8000
GPR12: C0000000076DB720 C00000000055F100 0000000000000000 0000000000000000
GPR16: 0000000000000000 000000000197FD90 4000000001C00000 000000000197FD88
GPR20: 000000000197FD98 C000000000503290 0000000000000060 0000000002103728
GPR24: C00000000055EB00 0000000000000001 0000000000000000 C0000000076DBEE0
GPR28: 0000000000000002 C0000000076DBDE0 C0000000076E4040 C000000000079FC0
NIP [C00000000000F7D0] .validate_sp+0x30/0xc0
LR [C00000000000FA2C] .show_stack+0xec/0x1d0
Call Trace:
[C0000000076DBBC0] [C00000000000FA18] .show_stack+0xd8/0x1d0 (unreliable)
[C0000000076DBC60] [C000000000433838] .schedule+0xd78/0xfb0
[C0000000076DBDE0] [C000000000079FC0] .worker_thread+0x1b0/0x1c0
Unable to handle kernel paging request for data at address 0xc000000600711710
Faulting instruction address: 0xc00000000000f7d0
Unable to handle kernel paging request for data at address 0xc00001800055eb30
Faulting instruction address: 0xc00000000005967c
Badness in __mutex_unlock_slowpath at kernel/mutex.c:209
Call Trace:
[C00000011E4FF770] [C00000000000F9A8] .show_stack+0x68/0x1d0 (unreliable)
[C00000011E4FF810] [C000000000436C00] .program_check_exception+0x2f0/0x600
[C00000011E4FF8B0] [C000000000004348] program_check_common+0xc8/0x100
--- Exception: 700 at .__mutex_unlock_slowpath+0xd4/0x250
LR = .generic_file_write+0x6c/0x130
[C00000011E4FFBA0] [C00000011E4FFBB0] 0xc00000011e4ffbb0 (unreliable)
[C00000011E4FFC30] [C0000000000A03DC] .generic_file_write+0x6c/0x130
[C00000011E4FFCF0] [C0000000000D380C] .vfs_write+0x14c/0x210
[C00000011E4FFD90] [C0000000000D44BC] .sys_write+0x4c/0x90
Unable to handle kernel paging request for data at address 0xc000000600711718
Faulting instruction address: 0xc00000000000f7d0

My kernel is 2.6.16-rc1 from the git tree yesterday (01/26/06). It does
the same thing with 2.6.15 though. I have attached my .config



--
Mark Haverkamp <[email protected]>


Attachments:
config_ppc64 (39.15 kB)

2006-01-27 20:40:49

by Olof Johansson

[permalink] [raw]
Subject: Re: iommu_alloc failure and panic

Hi,

On Fri, Jan 27, 2006 at 08:57:40AM -0800, Mark Haverkamp wrote:
> While running a disk test (bonnie++) I have been seeing iommu_alloc
> failed messages in the syslog leading to a panic. The machine that I

Hmm. The IOMMU code tries to be clever and not write on a cacheline
that's already in use, to avoid invalidating cached entries on the I/O
bridges.

Since a cache line is 128 bytes (16 entries), this means it advances
that much on each allocation, and tries to allocate a new entry in the
next block of 16. Essentially it is fragmenting the allocation space on
purpose.

So far I haven't seen anyone else have problems with this. I'm
suspecting that the SCSI probe code might map something per disk (or
similar), such that there's alot of small allocations being done, each
using up part of a line. once the end of the allocation space is
reached, the allocator wraps to the beginning and starts walking again.

Since this greatly reduces the chance of allocating anything 16 pages or
more, a part of the table (25%) is set aside and used for the large
allocations. If an allocation in one section of the table (large/small)
doesn't succeed, the other half is also searched.

Each allocation in your report is 10 entries, so that means that at
least 7 must be taken on each line, since the allocation won't go to the
large area by default.

Can you try this, just to confirm that this is what we're seeing?

1. In iommu_setparms_lpar() in arch/powerpc/platforms/pseries/iommu.c,
can you try changing it_blocksize from 16 to 1?


Thanks,

Olof

2006-01-27 22:39:58

by Mark Haverkamp

[permalink] [raw]
Subject: Re: iommu_alloc failure and panic

On Sat, 2006-01-28 at 09:40 +1300, Olof Johansson wrote:
> Hi,
>
> On Fri, Jan 27, 2006 at 08:57:40AM -0800, Mark Haverkamp wrote:
> > While running a disk test (bonnie++) I have been seeing iommu_alloc
> > failed messages in the syslog leading to a panic. The machine that I
>
> Hmm. The IOMMU code tries to be clever and not write on a cacheline
> that's already in use, to avoid invalidating cached entries on the I/O
> bridges.
>
> Since a cache line is 128 bytes (16 entries), this means it advances
> that much on each allocation, and tries to allocate a new entry in the
> next block of 16. Essentially it is fragmenting the allocation space on
> purpose.
>
> So far I haven't seen anyone else have problems with this. I'm
> suspecting that the SCSI probe code might map something per disk (or
> similar), such that there's alot of small allocations being done, each
> using up part of a line. once the end of the allocation space is
> reached, the allocator wraps to the beginning and starts walking again.
>
> Since this greatly reduces the chance of allocating anything 16 pages or
> more, a part of the table (25%) is set aside and used for the large
> allocations. If an allocation in one section of the table (large/small)
> doesn't succeed, the other half is also searched.
>
> Each allocation in your report is 10 entries, so that means that at
> least 7 must be taken on each line, since the allocation won't go to the
> large area by default.
>
> Can you try this, just to confirm that this is what we're seeing?
>
> 1. In iommu_setparms_lpar() in arch/powerpc/platforms/pseries/iommu.c,
> can you try changing it_blocksize from 16 to 1?

I tried this but I still see the iommu_alloc failures:

Jan 27 14:19:41 linux kernel: iommu_alloc failed, tbl c00000000263c480 vaddr c000000133b00000 npages 10
Jan 27 14:19:41 linux kernel: iommu_alloc failed, tbl c00000000263c480 vaddr c00000016e9ef000 npages 10
Jan 27 14:19:41 linux kernel: iommu_alloc failed, tbl c00000000263c480 vaddr c000000133b00000 npages 10
Jan 27 14:19:41 linux kernel: iommu_alloc failed, tbl c00000000263c480 vaddr c00000016e9ef000 npages 10
Jan 27 14:19:41 linux kernel: iommu_alloc failed, tbl c00000000263c480 vaddr c0000000f7970000 npages 10
Jan 27 14:19:41 linux kernel: iommu_alloc failed, tbl c00000000263c480 vaddr c00000013304f000 npages 10
Jan 27 14:19:41 linux kernel: iommu_alloc failed, tbl c00000000263c480 vaddr c00000016eb4a000 npages 10
Jan 27 14:19:41 linux kernel: iommu_alloc failed, tbl c00000000263c480 vaddr c00000017105b000 npages 10
Jan 27 14:19:41 linux kernel: iommu_alloc failed, tbl c00000000263c480 vaddr c00000016ec21000 npages 10
Jan 27 14:19:41 linux kernel: iommu_alloc failed, tbl c00000000263c480 vaddr c0000000b96c2000 npages 10
Jan 27 14:19:48 linux kernel: printk: 415 messages suppressed.
Jan 27 14:19:48 linux kernel: iommu_alloc failed, tbl c00000000263c480 vaddr c00000016d350000 npages 10
Jan 27 14:19:53 linux kernel: printk: 430 messages suppressed.
Jan 27 14:19:53 linux kernel: iommu_alloc failed, tbl c00000000263c480 vaddr c00000010a940000 npages 10
Jan 27 14:19:56 linux kernel: printk: 481 messages suppressed.
Jan 27 14:19:56 linux kernel: iommu_alloc failed, tbl c00000000263c480 vaddr c0000000a74d0000 npages 10
Jan 27 14:20:03 linux kernel: printk: 530 messages suppressed.
Jan 27 14:20:03 linux kernel: iommu_alloc failed, tbl c00000000263c480 vaddr c00000016036a000 npages 10
Jan 27 14:20:07 linux kernel: printk: 510 messages suppressed.
Jan 27 14:20:07 linux kernel: iommu_alloc failed, tbl c00000000263c480 vaddr c0000000971cc000 npages 10
Jan 27 14:20:11 linux kernel: printk: 482 messages suppressed.
Jan 27 14:20:11 linux kernel: iommu_alloc failed, tbl c00000000263c480 vaddr c000000167cae000 npages 10
Jan 27 14:20:16 linux kernel: printk: 381 messages suppressed.
Jan 27 14:20:16 linux kernel: iommu_alloc failed, tbl c00000000263c480 vaddr c00000011d384000 npages 10
Jan 27 14:20:21 linux kernel: printk: 776 messages suppressed.
Jan 27 14:20:21 linux kernel: iommu_alloc failed, tbl c00000000263c480 vaddr c00000015c942000 npages 10
Jan 27 14:20:26 linux kernel: printk: 643 messages suppressed.
Jan 27 14:20:26 linux kernel: iommu_alloc failed, tbl c00000000263c480 vaddr c000000148e22000 npages 10
Jan 27 14:20:31 linux kernel: printk: 455 messages suppressed.
Jan 27 14:20:31 linux kernel: iommu_alloc failed, tbl c00000000263c480 vaddr c000000128527000 npages 10
Jan 27 14:20:48 linux kernel: printk: 542 messages suppressed.
Jan 27 14:20:48 linux kernel: iommu_alloc failed, tbl c00000000263c480 vaddr c00000002f380000 npages 10
Jan 27 14:20:49 linux kernel: iommu_alloc failed, tbl c00000000263c480 vaddr c00000006ffad000 npages 10
Jan 27 14:20:49 linux kernel: iommu_alloc failed, tbl c00000000263c480 vaddr c00000006ffad000 npages 10
Jan 27 14:20:51 linux kernel: printk: 359 messages suppressed.
Jan 27 14:20:51 linux kernel: iommu_alloc failed, tbl c00000000263c480 vaddr c000000068db1000 npages 10

I would have thought that the npages would be 1 now.


Here is the console output.


Welcome to SUSE LINUX 10.0 (PPC) - Kernel 2.6.16-rc1-ppc64 (hvc0).


dev2-002 login: Oops: Kernel access of bad area, sig: 11 [#1]
SMP NR_CPUS=128 NUMA PSERIES LPAR
Modules linked in: dm_round_robin dm_multipath lpfc scsi_transport_fc ipv6 parport_pc lp parport dm_mod sg st ipr firmware_class sd_mod scsi_mod
NIP: C00000000000F7D0 LR: C00000000000FA2C CTR: 0000000000000000
REGS: c0000001e7bb3940 TRAP: 0300 Not tainted (2.6.16-rc1-ppc64)
MSR: 8000000000009032 <EE,ME,IR,DR> CR: 24002048 XER: 00000018
DAR: C000000600711718, DSISR: 0000000040010000
TASK = c00000000764f040[179] 'kblockd/0' THREAD: c0000001e7bb0000 CPU: 0
GPR00: 0000000600000008 C0000001E7BB3BC0 C00000000070EE68 C0000001E7BB3DE0
GPR04: C00000000764F040 00000000000002F0 0000000000080000 C00000000058C720
GPR08: 0000000000000001 C0000001E7BB3D10 C000000000711710 C0000001E7BB0000
GPR12: C0000001E7BB3720 C00000000055EB00 0000000000000000 0000000000000000
GPR16: 0000000000000000 000000000197FD90 4000000001C00000 000000000197FD88
GPR20: 000000000197FD98 C000000000503290 0000000000000060 0000000002103728
GPR24: C00000000055EB00 0000000000000001 0000000000000000 C0000001E7BB3EE0
GPR28: 0000000000000002 C0000001E7BB3DE0 C00000000764F040 C000000000079FC0
NIP [C00000000000F7D0] .validate_sp+0x30/0xc0
LR [C00000000000FA2C] .show_stack+0xec/0x1d0
Call Trace:
[C0000001E7BB3BC0] [C00000000000FA18] .show_stack+0xd8/0x1d0 (unreliable)
[C0000001E7BB3C60] [C000000000433838] .schedule+0xd78/0xfb0
[C0000001E7BB3DE0] [C000000000079FC0] .worker_thread+0x1b0/0x1c0
Unable to handle kernel paging request for data at address 0xc000000600711718
Faulting instruction address: 0xc00000000000f7d0
Unable to handle kernel paging request for data at address 0xc00001800055ed30
Faulting instruction address: 0xc000000000056ec4


--
Mark Haverkamp <[email protected]>

2006-01-27 23:35:10

by Olof Johansson

[permalink] [raw]
Subject: Re: iommu_alloc failure and panic

On Fri, Jan 27, 2006 at 02:39:50PM -0800, Mark Haverkamp wrote:

> I would have thought that the npages would be 1 now.

No, npages is the size of the allocation coming from the driver, that
won't chance. The table blocksize just says how wide the cacheline size
is, i.e. how far it should advance between allocations.

This is a patch that should probably have been added a while ago, to
give a bit more info. Can you apply it and give it a go?


Thanks,

Olof




Index: 2.6/arch/powerpc/kernel/iommu.c
===================================================================
--- 2.6.orig/arch/powerpc/kernel/iommu.c 2006-01-28 12:18:56.000000000 +1300
+++ 2.6/arch/powerpc/kernel/iommu.c 2006-01-28 12:31:20.000000000 +1300
@@ -146,6 +146,8 @@ static unsigned long iommu_range_alloc(s
if (handle)
*handle = end;

+ tbl->it_used += npages;
+
return n;
}

@@ -214,6 +216,8 @@ static void __iommu_free(struct iommu_ta

for (i = 0; i < npages; i++)
__clear_bit(free_entry+i, tbl->it_map);
+
+ tbl->it_used -= npages;
}

static void iommu_free(struct iommu_table *tbl, dma_addr_t dma_addr,
@@ -281,9 +285,11 @@ int iommu_map_sg(struct device *dev, str

/* Handle failure */
if (unlikely(entry == DMA_ERROR_CODE)) {
- if (printk_ratelimit())
- printk(KERN_INFO "iommu_alloc failed, tbl %p vaddr %lx"
+ if (printk_ratelimit()) {
+ printk(KERN_INFO "iommu_alloc_sg failed, tbl %p vaddr %lx"
" npages %lx\n", tbl, vaddr, npages);
+ printk(KERN_INFO "table size %lx used %lx\n", tbl->it_size, tbl->it_used);
+ }
goto failure;
}

@@ -422,6 +428,7 @@ struct iommu_table *iommu_init_table(str

tbl->it_hint = 0;
tbl->it_largehint = tbl->it_halfpoint;
+ tbl->it_used = 0;
spin_lock_init(&tbl->it_lock);

/* Clear the hardware table in case firmware left allocations in it */
@@ -496,6 +503,7 @@ dma_addr_t iommu_map_single(struct iommu
printk(KERN_INFO "iommu_alloc failed, "
"tbl %p vaddr %p npages %d\n",
tbl, vaddr, npages);
+ printk(KERN_INFO "table size %lx used %lx\n", tbl->it_size, tbl->it_used);
}
} else
dma_handle |= (uaddr & ~PAGE_MASK);
Index: 2.6/include/asm-powerpc/iommu.h
===================================================================
--- 2.6.orig/include/asm-powerpc/iommu.h 2006-01-21 03:14:30.000000000 +1300
+++ 2.6/include/asm-powerpc/iommu.h 2006-01-28 12:26:39.000000000 +1300
@@ -47,6 +47,7 @@ struct iommu_table {
unsigned long it_largehint; /* Hint for large allocs */
unsigned long it_halfpoint; /* Breaking point for small/large allocs */
spinlock_t it_lock; /* Protects it_map */
+ unsigned long it_used;
unsigned long *it_map; /* A simple allocation bitmap for now */
};

2006-01-27 23:37:40

by Mark Haverkamp

[permalink] [raw]
Subject: Re: iommu_alloc failure and panic

On Sat, 2006-01-28 at 12:34 +1300, Olof Johansson wrote:
> On Fri, Jan 27, 2006 at 02:39:50PM -0800, Mark Haverkamp wrote:
>
> > I would have thought that the npages would be 1 now.
>
> No, npages is the size of the allocation coming from the driver, that
> won't chance. The table blocksize just says how wide the cacheline size
> is, i.e. how far it should advance between allocations.
>
> This is a patch that should probably have been added a while ago, to
> give a bit more info. Can you apply it and give it a go?

OK, I'll try it and let you know.

Thanks,
Mark.
--
Mark Haverkamp <[email protected]>

2006-01-27 23:50:29

by Anton Blanchard

[permalink] [raw]
Subject: Re: iommu_alloc failure and panic


Hi,

> I have been testing large numbers of storage devices. I have 16000 scsi
> LUNs configured. (4000 fiberchannel LUNS seen 4 times). They are
> configured as 4000 multipath devices using device mapper. I have 100 of
> those devices configured as a logical volume using LVM2. Once I start
> bonnie++ using one of those logical volumes I start seeing iommu_alloc
> failures and then a panic. I don't have this issue when accessing the
> individual devices or the individual multipath devices. Only when
> conglomerated into a logical volume.
>
> Here is the syslog output:
>
> Jan 26 14:56:41 linux kernel: iommu_alloc failed, tbl c00000000263c480 vaddr c0000000d7967000 npages 10

This stuff should be OK since the lpfc driver should handle the iommu
filling up. Im guessing since you have so many LUNs you can get a whole
lot of outstanding IO, enough to fill up the TCE table.

> DAR: C000000600711710

> NIP [C00000000000F7D0] .validate_sp+0x30/0xc0
> LR [C00000000000FA2C] .show_stack+0xec/0x1d0
> Call Trace:
> [C0000000076DBBC0] [C00000000000FA18] .show_stack+0xd8/0x1d0 (unreliable)
> [C0000000076DBC60] [C000000000433838] .schedule+0xd78/0xfb0
> [C0000000076DBDE0] [C000000000079FC0] .worker_thread+0x1b0/0x1c0

This is interesting, an oops in validate_sp which is a pretty boring
function. We have seen this in the past when you overflow your kernel
stack. The bottom of the kernel stack contains the threadinfo struct and
in that we have the task_cpu() data.

When we overflow the stack and overwrite the threadinfo we end up with
a crazy number for task_cpu() and then oops when doing
hardirq_ctx[task_cpu(p)]. Can you turn on DEBUG_STACKOVERFLOW and
DEBUG_STACK_USAGE and see if you get any stack overflow warnings? The
second option will also allow you to do sysrq t and dump the most stack
each process has used at that point.

Anton

2006-01-28 00:00:54

by Mark Haverkamp

[permalink] [raw]
Subject: Re: iommu_alloc failure and panic

On Sat, 2006-01-28 at 10:48 +1100, Anton Blanchard wrote:
> Hi,
>
> > I have been testing large numbers of storage devices. I have 16000 scsi
> > LUNs configured. (4000 fiberchannel LUNS seen 4 times). They are
> > configured as 4000 multipath devices using device mapper. I have 100 of
> > those devices configured as a logical volume using LVM2. Once I start
> > bonnie++ using one of those logical volumes I start seeing iommu_alloc
> > failures and then a panic. I don't have this issue when accessing the
> > individual devices or the individual multipath devices. Only when
> > conglomerated into a logical volume.
> >
> > Here is the syslog output:
> >
> > Jan 26 14:56:41 linux kernel: iommu_alloc failed, tbl c00000000263c480 vaddr c0000000d7967000 npages 10
>
> This stuff should be OK since the lpfc driver should handle the iommu
> filling up. Im guessing since you have so many LUNs you can get a whole
> lot of outstanding IO, enough to fill up the TCE table.
>
> > DAR: C000000600711710
>
> > NIP [C00000000000F7D0] .validate_sp+0x30/0xc0
> > LR [C00000000000FA2C] .show_stack+0xec/0x1d0
> > Call Trace:
> > [C0000000076DBBC0] [C00000000000FA18] .show_stack+0xd8/0x1d0 (unreliable)
> > [C0000000076DBC60] [C000000000433838] .schedule+0xd78/0xfb0
> > [C0000000076DBDE0] [C000000000079FC0] .worker_thread+0x1b0/0x1c0
>
> This is interesting, an oops in validate_sp which is a pretty boring
> function. We have seen this in the past when you overflow your kernel
> stack. The bottom of the kernel stack contains the threadinfo struct and
> in that we have the task_cpu() data.
>
> When we overflow the stack and overwrite the threadinfo we end up with
> a crazy number for task_cpu() and then oops when doing
> hardirq_ctx[task_cpu(p)]. Can you turn on DEBUG_STACKOVERFLOW and
> DEBUG_STACK_USAGE and see if you get any stack overflow warnings?

It looks like they are already on:

CONFIG_DEBUG_STACKOVERFLOW=y
CONFIG_DEBUG_STACK_USAGE=y

> The
> second option will also allow you to do sysrq t and dump the most stack
> each process has used at that point.

The machine is frozen after the panic.


Mark.


>
> Anton
--
Mark Haverkamp <[email protected]>

2006-01-29 09:11:30

by Anton Blanchard

[permalink] [raw]
Subject: Re: iommu_alloc failure and panic


Hi,

> It looks like they are already on:
>
> CONFIG_DEBUG_STACKOVERFLOW=y
> CONFIG_DEBUG_STACK_USAGE=y

You could try bumping the limit for the warning in
arch/powerpc/kernel/irq.c

if (unlikely(sp < (sizeof(struct thread_info) + 2048)))

Change it to 4096.

Unfortunately if the long call chain is in an interrupt disabled area
you will not get warned.

> The machine is frozen after the panic.

You can do it sometime before the hang to see if you are getting close
to overflowing the stack.

Anton

2006-01-30 15:33:12

by Mark Haverkamp

[permalink] [raw]
Subject: Re: iommu_alloc failure and panic

On Sat, 2006-01-28 at 12:34 +1300, Olof Johansson wrote:
> On Fri, Jan 27, 2006 at 02:39:50PM -0800, Mark Haverkamp wrote:
>
> > I would have thought that the npages would be 1 now.
>
> No, npages is the size of the allocation coming from the driver, that
> won't chance. The table blocksize just says how wide the cacheline size
> is, i.e. how far it should advance between allocations.
>
> This is a patch that should probably have been added a while ago, to
> give a bit more info. Can you apply it and give it a go?
>
>
> Thanks,
>
> Olof
>


Here are the last few lines of the log before it crashed.


Jan 30 07:29:14 linux kernel: table size 10000 used f752
Jan 30 07:29:21 linux kernel: printk: 650 messages suppressed.
Jan 30 07:29:21 linux kernel: iommu_alloc_sg failed, tbl c0000001e794dd80 vaddr c000000059424000 npages 10
Jan 30 07:29:21 linux kernel: table size 10000 used f2c9
Jan 30 07:29:25 linux kernel: printk: 590 messages suppressed.
Jan 30 07:29:25 linux kernel: iommu_alloc_sg failed, tbl c0000001e794dd80 vaddr c00000007c5f9000 npages e
Jan 30 07:29:25 linux kernel: table size 10000 used f6ae
Jan 30 07:29:29 linux kernel: printk: 705 messages suppressed.
Jan 30 07:29:29 linux kernel: iommu_alloc_sg failed, tbl c0000001e794dd80 vaddr c0000000d3c9f000 npages e
Jan 30 07:29:29 linux kernel: table size 10000 used f530
Jan 30 07:29:35 linux kernel: printk: 357 messages suppressed.
Jan 30 07:29:35 linux kernel: iommu_alloc_sg failed, tbl c0000001e794dd80 vaddr c00000010156b000 npages 10
Jan 30 07:29:35 linux kernel: table size 10000 used e69e
Jan 30 07:29:39 linux kernel: printk: 481 messages suppressed.
Jan 30 07:29:39 linux kernel: iommu_alloc_sg failed, tbl c0000001e794dd80 vaddr c00000007616b000 npages 10
Jan 30 07:29:39 linux kernel: table size 10000 used f8e2
Jan 30 07:30:02 linux kernel: printk: 366 messages suppressed.
Jan 30 07:30:02 linux kernel: iommu_alloc_sg failed, tbl c0000001e794dd80 vaddr c0000000b0f80000 npages 10
Jan 30 07:30:02 linux kernel: table size 10000 used fbf2
Jan 30 07:30:02 linux kernel: iommu_alloc_sg failed, tbl c0000001e794dd80 vaddr c0000000b43d3000 npages 10
Jan 30 07:30:02 linux kernel: table size 10000 used fc00
Jan 30 07:30:02 linux kernel: iommu_alloc_sg failed, tbl c0000001e794dd80 vaddr c0000000ac267000 npages 10
Jan 30 07:30:02 linux kernel: table size 10000 used fbf1
Jan 30 07:30:02 linux kernel: iommu_alloc_sg failed, tbl c0000001e794dd80 vaddr c0000000b7380000 npages 10
Jan 30 07:30:02 linux kernel: table size 10000 used fc66
Jan 30 07:30:04 linux kernel: printk: 417 messages suppressed.
Jan 30 07:30:04 linux kernel: iommu_alloc_sg failed, tbl c0000001e794dd80 vaddr c0000000d94be000 npages 10
Jan 30 07:30:04 linux kernel: table size 10000 used ffd2
Jan 30 07:30:09 linux kernel: printk: 957 messages suppressed.
Jan 30 07:30:09 linux kernel: iommu_alloc_sg failed, tbl c0000001e794dd80 vaddr c0000000e59c0000 npages e
Jan 30 07:30:09 linux kernel: table size 10000 used fd18
Jan 30 07:30:14 linux kernel: printk: 990 messages suppressed.
Jan 30 07:30:14 linux kernel: iommu_alloc_sg failed, tbl c0000001e794dd80 vaddr c0000001019c3000 npages 10
Jan 30 07:30:14 linux kernel: table size 10000 used fa2b
Jan 30 07:30:19 linux kernel: printk: 881 messages suppressed.
Jan 30 07:30:19 linux kernel: iommu_alloc_sg failed, tbl c0000001e794dd80 vaddr c000000066901000 npages 10
Jan 30 07:30:19 linux kernel: table size 10000 used ee58
Jan 30 07:30:24 linux kernel: printk: 1094 messages suppressed.
Jan 30 07:30:24 linux kernel: iommu_alloc_sg failed, tbl c0000001e794dd80 vaddr c00000011d110000 npages 10
Jan 30 07:30:24 linux kernel: table size 10000 used f74a
Jan 30 07:30:29 linux kernel: printk: 890 messages suppressed.
Jan 30 07:30:29 linux kernel: iommu_alloc_sg failed, tbl c0000001e794dd80 vaddr c00000007db1c000 npages 10
Jan 30 07:30:29 linux kernel: table size 10000 used f6e3
Jan 30 07:30:34 linux kernel: printk: 1020 messages suppressed.
Jan 30 07:30:34 linux kernel: iommu_alloc_sg failed, tbl c0000001e794dd80 vaddr c00000004759f000 npages 10
Jan 30 07:30:34 linux kernel: table size 10000 used f972
Jan 30 07:30:39 linux kernel: printk: 1209 messages suppressed.
Jan 30 07:30:39 linux kernel: iommu_alloc_sg failed, tbl c0000001e794dd80 vaddr c00000003a39b000 npages 10
Jan 30 07:30:39 linux kernel: table size 10000 used e358
Jan 30 07:30:44 linux kernel: printk: 608 messages suppressed.
Jan 30 07:30:44 linux kernel: iommu_alloc_sg failed, tbl c0000001e794dd80 vaddr c00000000b086000 npages 10
Jan 30 07:30:44 linux kernel: table size 10000 used ff52


--
Mark Haverkamp <[email protected]>

2006-01-30 23:15:29

by Olof Johansson

[permalink] [raw]
Subject: Re: iommu_alloc failure and panic

On Mon, Jan 30, 2006 at 07:32:58AM -0800, Mark Haverkamp wrote:
> On Sat, 2006-01-28 at 12:34 +1300, Olof Johansson wrote:
> > On Fri, Jan 27, 2006 at 02:39:50PM -0800, Mark Haverkamp wrote:
> >
> > > I would have thought that the npages would be 1 now.
> >
> > No, npages is the size of the allocation coming from the driver, that
> > won't chance. The table blocksize just says how wide the cacheline size
> > is, i.e. how far it should advance between allocations.
> >
> > This is a patch that should probably have been added a while ago, to
> > give a bit more info. Can you apply it and give it a go?
> >
> >
> > Thanks,
> >
> > Olof
> >
>
>
> Here are the last few lines of the log before it crashed.
>
>
> Jan 30 07:29:14 linux kernel: table size 10000 used f752

Ok, that's a 256MB table, which is standard, and it seems to have been
filled with mappings. in some cases there's a few entries left but it's
likely that fragmentation causes the 10-entry alloc to fail, quite
normal.

There's two things to look at, unfortunately I fall short on both of
them myself:

1) There's a way to get more than the default 256MB DMA window for a PCI
slot. I'm not aware of the exact details, but you need recent firmware
and you configure it in the ASM menues (the web interface for the
service processor). Cc:ing Jake Moilanen in case he has any more up to
date info.

2) The emulex driver has been prone to problems in the past where it's
been very aggressive at starting DMA operations, and I think it can
be avoided with tuning. What I don't know is if it's because of this,
or simply because of the large number of targets you have. Cc:ing James
Smart.


-Olof

2006-01-31 01:19:10

by Doug Maxey

[permalink] [raw]
Subject: Re: iommu_alloc failure and panic


On Tue, 31 Jan 2006 10:13:48 +1100, Olof Johansson wrote:
>On Mon, Jan 30, 2006 at 07:32:58AM -0800, Mark Haverkamp wrote:
>> On Sat, 2006-01-28 at 12:34 +1300, Olof Johansson wrote:
>> > On Fri, Jan 27, 2006 at 02:39:50PM -0800, Mark Haverkamp wrote:
>> >
>> > > I would have thought that the npages would be 1 now.
>> >
>> > No, npages is the size of the allocation coming from the driver, that
>> > won't chance. The table blocksize just says how wide the cacheline size
>> > is, i.e. how far it should advance between allocations.
>> >
>> > This is a patch that should probably have been added a while ago, to
>> > give a bit more info. Can you apply it and give it a go?
>> >
>> >
>> > Thanks,
>> >
>> > Olof
>> >
>>
>>
>> Here are the last few lines of the log before it crashed.
>>
>>
>> Jan 30 07:29:14 linux kernel: table size 10000 used f752
>
>Ok, that's a 256MB table, which is standard, and it seems to have been
>filled with mappings. in some cases there's a few entries left but it's
>likely that fragmentation causes the 10-entry alloc to fail, quite
>normal.
>
>There's two things to look at, unfortunately I fall short on both of
>them myself:
>
>1) There's a way to get more than the default 256MB DMA window for a PCI
>slot. I'm not aware of the exact details, but you need recent firmware
>and you configure it in the ASM menues (the web interface for the
>service processor). Cc:ing Jake Moilanen in case he has any more up to
>date info.
>

ASM > System Configuration > I/O Adapter Enlarged Capacity.

This only applies to the last slot on a PHB. AKA superslot.


>2) The emulex driver has been prone to problems in the past where it's
>been very aggressive at starting DMA operations, and I think it can
>be avoided with tuning. What I don't know is if it's because of this,
>or simply because of the large number of targets you have. Cc:ing James
>Smart.
>
>
>-Olof
>_______________________________________________
>Linuxppc64-dev mailing list
>[email protected]
>https://ozlabs.org/mailman/listinfo/linuxppc64-dev
>


2006-01-31 13:42:53

by James Smart

[permalink] [raw]
Subject: Re: iommu_alloc failure and panic


>> 2) The emulex driver has been prone to problems in the past where it's
>> been very aggressive at starting DMA operations, and I think it can
>> be avoided with tuning. What I don't know is if it's because of this,
>> or simply because of the large number of targets you have. Cc:ing James
>> Smart.

I don't have data points for the 2.6 kernel, but I can comment on what I
have seen on the 2.4 kernel.

The issue that I saw on the 2.4 kernel was that the pci dma alloc routine
was inappropriately allocating from the dma s/g maps. On systems with less
than 4Gig of memory, or on those with no iommmu (emt64), the checks around
adapter-supported dma masks were off (I'm going to be loose in terms to not
describe it in detail). The result was, although the adapter could support
a fully 64bit address and/or although the physical dma address would be under
32-bits, the logic forced allocation from the mapped dma pool. On some
systems, this pool was originally only 16MB. Around 2.4.30, the swiotlb was
introduced, which reduced issue, but unfortunately, still never solved the
allocation logic. It fails less as the swiotlb simply had more space.
As far as I know, this problem doesn't exist in the 2.6 kernel. I'd have to
go look at the dma map functions to make sure.

Why was the lpfc driver prone to the dma map exhaustion failures ? Due to the
default # of commands per lun and max sg segments reported by the driver to
the scsi midlayer, the scsi mid-layer's preallocation of dma maps for commands
for each lun, and the fact that our FC configs were usually large, had lots
of luns, and replicated the resources for each path to the same storage.

Ultimately, what I think is the real issue here is the way the scsi mid-layer
is preallocating dma maps for the luns. 16000 luns is a huge number.
Multiply this by a max sg segment count of 64 by the driver, and a number
between 3 and 30 commands per lun, and you can see the numbers. Scsi does do
some interesting allocation algorithms once it hits an allocation failure.
One side effect of this is that it is fairly efficient at allocating the
bulk of the dma pool.

-- james s

2006-01-31 21:33:40

by Mark Haverkamp

[permalink] [raw]
Subject: Re: iommu_alloc failure and panic

On Tue, 2006-01-31 at 08:41 -0500, James Smart wrote:
> >> 2) The emulex driver has been prone to problems in the past where it's
> >> been very aggressive at starting DMA operations, and I think it can
> >> be avoided with tuning. What I don't know is if it's because of this,
> >> or simply because of the large number of targets you have. Cc:ing James
> >> Smart.
>
> I don't have data points for the 2.6 kernel, but I can comment on what I
> have seen on the 2.4 kernel.
>
> The issue that I saw on the 2.4 kernel was that the pci dma alloc routine
> was inappropriately allocating from the dma s/g maps. On systems with less
> than 4Gig of memory, or on those with no iommmu (emt64), the checks around
> adapter-supported dma masks were off (I'm going to be loose in terms to not
> describe it in detail). The result was, although the adapter could support
> a fully 64bit address and/or although the physical dma address would be under
> 32-bits, the logic forced allocation from the mapped dma pool. On some
> systems, this pool was originally only 16MB. Around 2.4.30, the swiotlb was
> introduced, which reduced issue, but unfortunately, still never solved the
> allocation logic. It fails less as the swiotlb simply had more space.
> As far as I know, this problem doesn't exist in the 2.6 kernel. I'd have to
> go look at the dma map functions to make sure.
>
> Why was the lpfc driver prone to the dma map exhaustion failures ? Due to the
> default # of commands per lun and max sg segments reported by the driver to
> the scsi midlayer, the scsi mid-layer's preallocation of dma maps for commands
> for each lun, and the fact that our FC configs were usually large, had lots
> of luns, and replicated the resources for each path to the same storage.
>
> Ultimately, what I think is the real issue here is the way the scsi mid-layer
> is preallocating dma maps for the luns. 16000 luns is a huge number.
> Multiply this by a max sg segment count of 64 by the driver, and a number
> between 3 and 30 commands per lun, and you can see the numbers. Scsi does do
> some interesting allocation algorithms once it hits an allocation failure.
> One side effect of this is that it is fairly efficient at allocating the
> bulk of the dma pool.

James,

Thanks for the information. I tried loading the lpfc driver with
lpfc_lun_queue_depth=1 and haven't seen iommu_alloc failures. I'm still
curious why the alloc failures lead to a panic though.

Mark.


>
> -- james s
--
Mark Haverkamp <[email protected]>