2006-01-27 17:28:32

by Stefan Kaltenbrunner

[permalink] [raw]
Subject: qla2xxx related oops in 2.6.16-rc1

We hit the following oops in 2.6.16-rc1 during itesting of a
devicemapper based multipath infrastructure.

The oops happend during heavy io on the devicemapper device and a reboot
of one of the switches the host was directly connected too.

The host in questions is as Dual Opteron 280 with 16GB ram and two
qla2340 adapters accessing an IBM DS4300 Array.

Stefan

Unable to handle kernel NULL pointer dereference at 0000000000000000 RIP:
<ffffffff803cc6c6>{_spin_lock+0}
PGD 3ff513067 PUD 3ff514067 PMD 0
Oops: 0002 [1] SMP
CPU 0
Modules linked in: dm_round_robin dm_multipath dm_mod i2c_amd756 qla2300
qla2xxx i2c_core evdev
Pid: 2568, comm: qla2300_1_dpc Not tainted 2.6.16-rc1 #4
RIP: 0010:[<ffffffff803cc6c6>] <ffffffff803cc6c6>{_spin_lock+0}
RSP: 0018:ffff8101ffbb1d70 EFLAGS: 00010286
RAX: ffffffff804c6cc8 RBX: ffff8101fea11c78 RCX: ffffffffffffffd8
RDX: ffffffffffffffd8 RSI: ffffffff803ca8e1 RDI: 0000000000000000
RBP: ffff8101fea54160 R08: ffff8101ffbb0000 R09: 000000000000000a
R10: 000000000000000a R11: ffff8103ffd18800 R12: 0000000000000000
R13: 0000000000000000 R14: ffffffff880288c8 R15: 0000000000509c10
FS: 00002b031d5f0640(0000) GS:ffffffff80571000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000000 CR3: 00000003ff512000 CR4: 00000000000006e0
Process qla2300_1_dpc (pid: 2568, threadinfo ffff8101ffbb0000, task
ffff8101ff6311e0)
Stack: ffffffff803ca95a ffffffff804c6cc8 ffff8101fea11c50 ffff8101fea54000
ffffffff802bd255 000000000000000a ffff8101fea11c50 ffff8101fea11c00
ffffffff8031acf5 ffff8101ffcbd600
Call Trace: <ffffffff803ca95a>{klist_del+18}
<ffffffff802bd255>{device_del+28}
<ffffffff8031acf5>{fc_rport_terminate+81}
<ffffffff8800f3bb>{:qla2xxx:qla2x00_reg_remote_port+28}
<ffffffff88010001>{:qla2xxx:qla2x00_fabric_dev_login+111}
<ffffffff8800f749>{:qla2xxx:qla2x00_configure_fabric+503}
<ffffffff8800efa4>{:qla2xxx:qla2x00_configure_loop+283}
<ffffffff8801020d>{:qla2xxx:qla2x00_loop_resync+95}
<ffffffff8800c8bb>{:qla2xxx:qla2x00_do_dpc+655}
<ffffffff8010b726>{child_rip+8}
<ffffffff8800c62c>{:qla2xxx:qla2x00_do_dpc+0}
<ffffffff8010b71e>{child_rip+0}

Code: f0 ff 0f 0f 88 14 01 00 00 c3 48 89 f8 f0 81 28 00 00 00 01
RIP <ffffffff803cc6c6>{_spin_lock+0} RSP <ffff8101ffbb1d70>
CR2: 0000000000000000
<6>qla2300 0000:05:08.0: scsi(0:4:0): Abort command issued -- 31e4c 2002.


2006-01-30 15:34:39

by Andrew Vasquez

[permalink] [raw]
Subject: Re: qla2xxx related oops in 2.6.16-rc1

On Fri, 27 Jan 2006, Stefan Kaltenbrunner wrote:

> We hit the following oops in 2.6.16-rc1 during itesting of a
> devicemapper based multipath infrastructure.
>
> The oops happend during heavy io on the devicemapper device and a reboot
> of one of the switches the host was directly connected too.
>
> The host in questions is as Dual Opteron 280 with 16GB ram and two
> qla2340 adapters accessing an IBM DS4300 Array.
>
> Stefan
>
> Unable to handle kernel NULL pointer dereference at 0000000000000000 RIP:
> <ffffffff803cc6c6>{_spin_lock+0}
> PGD 3ff513067 PUD 3ff514067 PMD 0
> Oops: 0002 [1] SMP
> CPU 0
> Modules linked in: dm_round_robin dm_multipath dm_mod i2c_amd756 qla2300
> qla2xxx i2c_core evdev
> Pid: 2568, comm: qla2300_1_dpc Not tainted 2.6.16-rc1 #4
> RIP: 0010:[<ffffffff803cc6c6>] <ffffffff803cc6c6>{_spin_lock+0}

Could you retry your tests with the following patchset:

http://marc.theaimsgroup.com/?l=linux-scsi&m=113779768321616&w=2
http://marc.theaimsgroup.com/?l=linux-scsi&m=113779768230038&w=2
http://marc.theaimsgroup.com/?l=linux-scsi&m=113779768230735&w=2

they will apply cleanly an 2.6.16-rc1 tree.

Regards,
Andrew Vasquez

2006-01-30 20:24:28

by Olaf Hering

[permalink] [raw]
Subject: Re: qla2xxx related oops in 2.6.16-rc1

On Mon, Jan 30, Andrew Vasquez wrote:

> On Fri, 27 Jan 2006, Stefan Kaltenbrunner wrote:
>
> > We hit the following oops in 2.6.16-rc1 during itesting of a
> > devicemapper based multipath infrastructure.
> >
> > The oops happend during heavy io on the devicemapper device and a reboot
> > of one of the switches the host was directly connected too.
> >
> > The host in questions is as Dual Opteron 280 with 16GB ram and two
> > qla2340 adapters accessing an IBM DS4300 Array.
> >
> > Stefan
> >
> > Unable to handle kernel NULL pointer dereference at 0000000000000000 RIP:
> > <ffffffff803cc6c6>{_spin_lock+0}
> > PGD 3ff513067 PUD 3ff514067 PMD 0
> > Oops: 0002 [1] SMP
> > CPU 0
> > Modules linked in: dm_round_robin dm_multipath dm_mod i2c_amd756 qla2300
> > qla2xxx i2c_core evdev
> > Pid: 2568, comm: qla2300_1_dpc Not tainted 2.6.16-rc1 #4
> > RIP: 0010:[<ffffffff803cc6c6>] <ffffffff803cc6c6>{_spin_lock+0}
>
> Could you retry your tests with the following patchset:

This is a generic bug. I hit it as well several times during my testing of
https://bugzilla.novell.com/show_bug.cgi?id=145459

If my slab corruption and this one is the same cause, no idea.

--
short story of a lazy sysadmin:
alias appserv=wotan

2006-01-31 10:07:16

by Olaf Hering

[permalink] [raw]
Subject: Re: qla2xxx related oops in 2.6.16-rc1

On Mon, Jan 30, Andrew Vasquez wrote:

> On Fri, 27 Jan 2006, Stefan Kaltenbrunner wrote:
>
> > We hit the following oops in 2.6.16-rc1 during itesting of a
> > devicemapper based multipath infrastructure.
> >
> > The oops happend during heavy io on the devicemapper device and a reboot
> > of one of the switches the host was directly connected too.
> >
> > The host in questions is as Dual Opteron 280 with 16GB ram and two
> > qla2340 adapters accessing an IBM DS4300 Array.
> >
> > Stefan
> >
> > Unable to handle kernel NULL pointer dereference at 0000000000000000 RIP:
> > <ffffffff803cc6c6>{_spin_lock+0}
> > PGD 3ff513067 PUD 3ff514067 PMD 0
> > Oops: 0002 [1] SMP
> > CPU 0
> > Modules linked in: dm_round_robin dm_multipath dm_mod i2c_amd756 qla2300
> > qla2xxx i2c_core evdev
> > Pid: 2568, comm: qla2300_1_dpc Not tainted 2.6.16-rc1 #4
> > RIP: 0010:[<ffffffff803cc6c6>] <ffffffff803cc6c6>{_spin_lock+0}

This one happens at least since 58b6c58caef7a34eab7ec887288fa495696653e7

--
short story of a lazy sysadmin:
alias appserv=wotan

2006-01-31 21:23:28

by Stefan Kaltenbrunner

[permalink] [raw]
Subject: Re: qla2xxx related oops in 2.6.16-rc1

Olaf Hering wrote:
> On Mon, Jan 30, Andrew Vasquez wrote:
>
>
>>On Fri, 27 Jan 2006, Stefan Kaltenbrunner wrote:
>>
>>
>>>We hit the following oops in 2.6.16-rc1 during itesting of a
>>>devicemapper based multipath infrastructure.
>>>
>>>The oops happend during heavy io on the devicemapper device and a reboot
>>>of one of the switches the host was directly connected too.
>>>
>>>The host in questions is as Dual Opteron 280 with 16GB ram and two
>>>qla2340 adapters accessing an IBM DS4300 Array.
>>>
>>>Stefan
>>>
>>>Unable to handle kernel NULL pointer dereference at 0000000000000000 RIP:
>>><ffffffff803cc6c6>{_spin_lock+0}
>>>PGD 3ff513067 PUD 3ff514067 PMD 0
>>>Oops: 0002 [1] SMP
>>>CPU 0
>>>Modules linked in: dm_round_robin dm_multipath dm_mod i2c_amd756 qla2300
>>>qla2xxx i2c_core evdev
>>>Pid: 2568, comm: qla2300_1_dpc Not tainted 2.6.16-rc1 #4
>>>RIP: 0010:[<ffffffff803cc6c6>] <ffffffff803cc6c6>{_spin_lock+0}
>
>
> This one happens at least since 58b6c58caef7a34eab7ec887288fa495696653e7

After applying Andrews patches I have so far failed to reproduce the
issue again - but I'm not really convinced that it is really gone now
since I could not trigger it very reliably before too ...


Stefan

2006-01-31 21:30:16

by Olaf Hering

[permalink] [raw]
Subject: Re: qla2xxx related oops in 2.6.16-rc1

On Tue, Jan 31, Stefan Kaltenbrunner wrote:

> After applying Andrews patches I have so far failed to reproduce the
> issue again - but I'm not really convinced that it is really gone now
> since I could not trigger it very reliably before too ...

I hit that several times, and will track it down once my memcorruption
bug is sorted out.

--
short story of a lazy sysadmin:
alias appserv=wotan